Aggregating processor free time for energy reduction

Even after carefully tuning the memory characteristics to the application properties and the processor speed, during the execution of real applications there are times when the processor stalls, waiting for data from the memory. Processor stall can be used to increase the throughput by temporarily switching to a different thread of execution, or reduce the power and energy consumption by temporarily switching the processor to low-power mode. However, any such technique has a performance overhead in terms of switching time. Even though over the execution of an application the processor is stalled for a considerable amount of time, each stall duration is too small to profitably perform any state switch. In this paper, we present code transformations to aggregate processor free time. Our experiments on the Intel XScale and Stream kernels show that up to 50,000 processor cycles can be aggregated, and used to profitably switch the processor to low-power mode. We further show that our code transformations can switch the processor to low-power mode for up to 75% of kernel runtime, achieving up to 18% of processor energy savings on multimedia applications. Our technique requires minimal architectural modifications and incurs negligible ( < 1%) performance loss.

[1]  Massoud Pedram,et al.  Low power design methodologies , 1996 .

[2]  David J. Lilja,et al.  Data prefetch mechanisms , 2000, CSUR.

[3]  Massoud Pedram,et al.  Power Aware Design Methodologies , 2002 .

[4]  Luca Benini,et al.  A survey of design techniques for system-level dynamic power management , 2000, IEEE Trans. Very Large Scale Integr. Syst..

[5]  Margaret Martonosi,et al.  XTREM: a power simulator for the Intel XScale® core , 2004, LCTES '04.

[6]  Larry L. Biro,et al.  Power considerations in the design of the Alpha 21264 microprocessor , 1998, Proceedings 1998 Design and Automation Conference. 35th DAC. (Cat. No.98CH36175).

[7]  Trevor Mudge,et al.  MiBench: A free, commercially representative embedded benchmark suite , 2001 .

[8]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[9]  Lawrence T. Clark,et al.  An embedded 32-b microprocessor core for low-power and high-performance applications , 2001 .