Scaling Power and Performance viaProcessor Composability

Power dissipation trends are leading high-performance processors to a regime in which all chip elements cannot be operated simultaneously at maximum frequency. Consequently, energy-efficiency will increase even more in importance, and performance must be achieved within strict power budgets. Current designs employ techniques such as dynamic voltage and frequency scaling (DVFS) to provide power-performance tradeoffs for both single and multi-threaded workloads. In power-dominated regimes, processors will be run at or near the minimum voltage. Frequency can be reduced to save power, but there is no scaling strategy for increasing performance with high energy-efficiency if the processor is operating at its maximum frequency (and minimum voltage). In this paper, we evaluate the energy-efficiency of processor composability-dynamically aggregating small energy-efficient physical cores into larger logical processors-as a method of scaling single-threaded performance up and down, comparing composability to the energy-efficiency of voltage and frequency scaling. We measure the power breakdowns of the baseline composable microarchitecture (the TFlex microarchitecture, based on an EDGE ISA) and compare the energy-efficiency and performance to one processor designed for power-efficiency (XScale) and another designed for high-performance (a variant of the Power-4) using normalized power models for as fair a comparison as possible. The study shows that composing multiple dual-issue cores (up to eight) provides performance scaling that is as energy-efficient as frequency scaling in a balanced microarchitecture, and is considerably more efficient than scaling the voltage to achieve additional performance once the maximum frequency at the minimum voltage is attained.

[1]  Brad Calder,et al.  Basic block distribution analysis to find periodic behavior and simulation points in applications , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[2]  Michael L. Scott,et al.  Profile-based dynamic voltage and frequency scaling for a multiple clock domain microprocessor , 2003, ISCA '03.

[3]  Margaret Martonosi,et al.  The XTREM power and performance simulator for the Intel XScale core: Design and experiences , 2007, TECS.

[4]  H. H. Chen,et al.  CPAM: a common power analysis methodology for high-performance VLSI design , 2000, IEEE 9th Topical Meeting on Electrical Performance of Electronic Packaging (Cat. No.00TH8524).

[5]  Norman P. Jouppi,et al.  Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction , 2003, MICRO.

[6]  Mark D. Hill,et al.  Amdahl's Law in the Multicore Era , 2008, Computer.

[7]  Scott A. Mahlke,et al.  Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-thread Applications , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[8]  Lizy Kurian John,et al.  Scaling to the end of silicon with EDGE architectures , 2004, Computer.

[9]  MartonosiMargaret,et al.  The XTREM power and performance simulator for the Intel XScale core , 2007 .

[10]  Doug Burger,et al.  End-to-end validation of architectural power models , 2009, ISLPED.

[11]  Milos D. Ercegovac,et al.  The Art of Deception: Adaptive Precision Reduction for Area Efficient Physics Acceleration , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[12]  Doug Burger,et al.  TRIPS: A distributed explicit data graph execution (EDGE) microprocessor , 2007, 2007 IEEE Hot Chips 19 Symposium (HCS).

[13]  Norman P. Jouppi,et al.  Single-ISA heterogeneous multi-core architectures for multithreaded workload performance , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[14]  John Paul Shen,et al.  Mitigating Amdahl's law through EPI throttling , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[15]  KumarRakesh,et al.  Single-ISA Heterogeneous Multi-Core Architectures for Multithreaded Workload Performance , 2004 .

[16]  Doug Burger,et al.  An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[17]  Simha Sethumadhavan,et al.  Distributed Microarchitectural Protocols in the TRIPS Prototype Processor , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[18]  Aaron Smith,et al.  Compiling for EDGE architectures , 2006, International Symposium on Code Generation and Optimization (CGO'06).

[19]  Mark D. Hill,et al.  Amdahl's Law in the Multicore Era , 2008 .

[20]  SankaralingamKarthikeyan,et al.  Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture , 2003 .

[21]  Kevin Skadron,et al.  Federation: Out-of-Order Execution using Simple In-Order Cores , 2007 .

[22]  S. Winkel Optimal versus Heuristic Global Code Scheduling , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[23]  Jian Li,et al.  Dynamic power-performance adaptation of parallel computation on chip multiprocessors , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[24]  Behnam Robatmili,et al.  Efficient execution of sequential applications on multicore systems , 2011 .

[25]  Yu Cao,et al.  New generation of predictive technology model for sub-45nm design exploration , 2006, 7th International Symposium on Quality Electronic Design (ISQED'06).

[26]  Yu Cao,et al.  New Generation of Predictive Technology Model for Sub-45 nm Early Design Exploration , 2006, IEEE Transactions on Electron Devices.

[27]  Margaret Martonosi,et al.  An Analysis of Efficient Multi-Core Global Power Management Policies: Maximizing Performance for a Given Power Budget , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[28]  Diana Marculescu,et al.  Power and performance evaluation of globally asynchronous locally synchronous processors , 2002, ISCA.

[29]  Doug Burger,et al.  Exploiting criticality to reduce bottlenecks in distributed uniprocessors , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[30]  David Harris,et al.  CMOS VLSI Design: A Circuits and Systems Perspective , 2004 .

[31]  Margaret Martonosi,et al.  Wattch: a framework for architectural-level power analysis and optimizations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[32]  Jaehyuk Huh,et al.  Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture , 2003, ISCA '03.

[33]  Madhu Sarava Govindan E³ : energy-efficient EDGE architectures , 2010 .

[34]  Pradip Bose,et al.  Validation of Turandot, a fast processor model for microarchitecture exploration , 1999, 1999 IEEE International Performance, Computing and Communications Conference (Cat. No.99CH36305).

[35]  Michael L. Scott,et al.  Energy-efficient processor design using multiple clock domains with dynamic voltage and frequency scaling , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[36]  Engin Ipek,et al.  Core fusion: accommodating software diversity in chip multiprocessors , 2007, ISCA '07.

[37]  Jeffrey R. Diamond,et al.  An evaluation of the TRIPS computer system , 2009, ASPLOS.

[38]  Balaram Sinharoy,et al.  POWER4 system microarchitecture , 2002, IBM J. Res. Dev..

[39]  BurgerDoug,et al.  An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002 .

[40]  H AlbonesiDavid,et al.  Profile-based dynamic voltage and frequency scaling for a multiple clock domain microprocessor , 2003 .

[41]  Amitava Chatterjee,et al.  Dopant Fluctuations and Quantum Effects in Sub-0.1um CMOS , 1997 .

[42]  Manish Gupta,et al.  Power-Aware Microarchitecture: Design and Modeling Challenges for Next-Generation Microprocessors , 2000, IEEE Micro.

[43]  W. Marsden I and J , 2012 .

[44]  Kathryn S. McKinley,et al.  Strategies for mapping dataflow blocks to distributed hardware , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[45]  Dirk Grunwald,et al.  Aide de Camp: Asymmetric Dual Core Design for Power and Energy Reduction ; CU-CS-964-03 , 2003 .