An Evaluation of Deeply Decoupled Cores

The trend towards larger structures and aggressive clock frequencies has been a fundamental driving force for modern microprocessor design. While one approach is to deeply pipeline any high delay structure, dependencies and critical loops have made it increasingly difcult to speed execution through extensive pipelining. One alternative is to remove larger structures from the critical path. We explore the ramications of stripping all but the most necessary functionality out of the processing core, leaving only a tiny -core. Although past studies have shown the possibility to build decoupled structures for some individual helper structures, the impact of streamlining all of these structures at the same time has not been explored. Along with describing the challenges in decoupling the helper engines, we focus on the performance, power consumption and thermal behavior of the -core architecture. We use a detailed performance, power and thermal modeling in our analysis. Our results indicate that the -core provides a 20% reduction in power over a conventional monolithic core, while providing comparable performance (1% improvement on average). By dynamically conguring the helper engines to different application phases, an additional 13% power savings can be attained with only an average 3% degradation in performance. Our experi

[1]  David H. Albonesi,et al.  Selective cache ways: on-demand cache resource allocation , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[2]  Yale N. Patt,et al.  Reducing the performance impact of instruction cache misses by writing instructions into the reservation stations out-of-order , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[3]  Trevor Mudge Power: A First Class Design Constraint for Future Architecture and Automation , 2000, HiPC.

[4]  Norman P. Jouppi,et al.  Cacti 3. 0: an integrated cache timing, power, and area model , 2001 .

[5]  Brad Calder,et al.  Phase tracking and prediction , 2003, ISCA '03.

[6]  Rajeev Balasubramonian,et al.  Reducing the complexity of the register file in dynamic superscalar processors , 2001, MICRO.

[7]  Mikko H. Lipasti,et al.  Exceeding the dataflow limit via value prediction , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[8]  Kevin Skadron,et al.  Temperature-aware microarchitecture , 2003, ISCA '03.

[9]  Stephen H. Gunther,et al.  Managing the Impact of Increasing Microprocessor Power Consumption , 2001 .

[10]  David J. Sager,et al.  The microarchitecture of the Pentium 4 processor , 2001 .

[11]  Gurindar S. Sohi,et al.  A static power model for architects , 2000, MICRO 33.

[12]  Vikas Agarwal,et al.  Clock rate versus IPC: the end of the road for conventional microarchitectures , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[13]  Kevin Skadron,et al.  HotLeakage: A Temperature-Aware Model of Subthreshold and Gate Leakage for Architects , 2003 .

[14]  W. Robert Daasch,et al.  A thermal-aware superscalar microprocessor , 2002, Proceedings International Symposium on Quality Electronic Design.

[15]  Eric Sprangle,et al.  Increasing processor performance by implementing deeper pipelines , 2002, ISCA.

[16]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[17]  Manish Gupta,et al.  Power-Aware Microarchitecture: Design and Modeling Challenges for Next-Generation Microprocessors , 2000, IEEE Micro.

[18]  Margaret Martonosi,et al.  Dynamic thermal management for high-performance microprocessors , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[19]  BurgerDoug,et al.  The SimpleScalar tool set, version 2.0 , 1997 .

[20]  Yale N. Patt,et al.  A comprehensive instruction fetch mechanism for a processor supporting speculative execution , 1992, MICRO 25.

[21]  David Kroft,et al.  Lockup-free instruction fetch/prefetch cache organization , 1998, ISCA '81.

[22]  Gurindar S. Sohi,et al.  ARB: A Hardware Mechanism for Dynamic Reordering of Memory References , 1996, IEEE Trans. Computers.

[23]  Norman P. Jouppi,et al.  The multicluster architecture: reducing cycle time through partitioning , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[24]  Margaret Martonosi,et al.  Wattch: a framework for architectural-level power analysis and optimizations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[25]  Andrew R. Pleszkun,et al.  Implementation of precise interrupts in pipelined processors , 1985, ISCA '98.

[26]  James E. Smith,et al.  Instruction-Level Distributed Processing , 2001, Computer.

[27]  William H. Mangione-Smith,et al.  The filter cache: an energy efficient memory structure , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[28]  James E. Smith,et al.  An instruction set and microarchitecture for instruction level distributed processing , 2002, ISCA.

[29]  Pascal Sainrat,et al.  Multiple-block ahead branch predictors , 1996, ASPLOS VII.

[30]  Brad Calder,et al.  Predictor-directed stream buffers , 2000, MICRO 33.

[31]  Glenn Reinman,et al.  A scalable front-end architecture for fast instruction delivery , 1999, ISCA.

[32]  Brad Calder,et al.  Automatically characterizing large scale program behavior , 2002, ASPLOS X.

[33]  Kai Wang,et al.  Highly accurate data value prediction using hybrid predictors , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[34]  Seung-Moon Yoo,et al.  A framework for dynamic energy efficiency and temperature management , 2000, Proceedings 33rd Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-33 2000.

[35]  Lieven Eeckhout,et al.  Measuring Program Similarity: Experiments with SPEC CPU Benchmark Suites , 2005, IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005..

[36]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[37]  Joel S. Emer,et al.  Loose loops sink chips , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.