Increased scalability and power efficiency by using multiple speed pipelines

One of the most important problems faced by microarchitecture designers is the poor scalability of some of the current solutions with increased clock frequencies and wider pipelines. As several studies show, internal processor structures scale differently with decreasing device sizes. While in some cases the access latency is determined by the speed of the logic circuitry, for others it is dominated by the interconnect delay. Furthermore, while some stages can be super-pipelined with relatively small performance loss, others must be kept atomic. This paper proposes a possible solution to this problem, avoiding the traditional trade-off between parallelism and clock speed. First, allowing instructions to enter and leave the Issue Window in an asynchronously manner enables faster speeds in the front-end at the expense of small synchronization latencies. Second, using an Execution Cache for storing instructions that are already scheduled allows for bypassing the issue circuitry and thus clocking the execution core at higher frequencies. Combined, these two mechanisms result in a 50% to 60% performance increase for our test microarchitecture, without requiring a completely new scheduling mechanism. Furthermore, the proposed microarchitecture requires significantly less energy, with 30% reduction in a 0.1 Sum or 20% in a 0.06um process technology over the original baseline.

[1]  Emil Talpes,et al.  Power reduction through work reuse , 2001, ISLPED '01.

[2]  S. McFarling Combining Branch Predictors , 1993 .

[3]  Eric Rotenberg,et al.  A Trace Cache Microarchitecture and Evaluation , 1999, IEEE Trans. Computers.

[4]  Daniel H. Friendly,et al.  Evaluation of Design Options for the Trace Cache Fetch Mechanism , 1999, IEEE Trans. Computers.

[5]  John M. Cohn,et al.  Managing power and performance for system-on-chip designs using Voltage Islands , 2002, IEEE/ACM International Conference on Computer Aided Design, 2002. ICCAD 2002..

[6]  Sumedh W. Sathaye,et al.  MPS: Miss-Path Scheduling for Multiple-Issue Processors , 1998, IEEE Trans. Computers.

[7]  Frans Theeuwen,et al.  Power Reduction Through Clock Gating by Symbolic Manipulation , 1997 .

[8]  Kenneth C. Yeager The Mips R10000 superscalar microprocessor , 1996, IEEE Micro.

[9]  Margaret Martonosi,et al.  Wattch: a framework for architectural-level power analysis and optimizations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[10]  Norman P. Jouppi,et al.  The optimal logic depth per pipeline stage is 6 to 8 FO4 inverter delays , 2002, ISCA.

[11]  David J. Sager,et al.  The microarchitecture of the Pentium 4 processor , 2001 .

[12]  John Paul Shen,et al.  Turboscalar: A High Frequency High IPC Microarchitecture , 2000, ISCA 2000.

[13]  Diana Marculescu,et al.  Power and performance evaluation of globally asynchronous locally synchronous processors , 2002, ISCA.

[14]  Gurindar S. Sohi,et al.  A static power model for architects , 2000, MICRO 33.

[15]  Michael L. Scott,et al.  Energy-efficient processor design using multiple clock domains with dynamic voltage and frequency scaling , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[16]  R. Nair,et al.  Exploiting Instruction Level Parallelism In Processors By Caching Scheduled Groups , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[17]  John Paul Shen,et al.  The block-based trace cache , 1999, ISCA.

[18]  Vladimir Stojanovic,et al.  Methods for true power minimization , 2002, IEEE/ACM International Conference on Computer Aided Design, 2002. ICCAD 2002..

[19]  Yale N. Patt,et al.  On pipelining dynamic instruction scheduling logic , 2000, Proceedings 33rd Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-33 2000.

[20]  Avi Mendelson,et al.  Filtering techniques to improve trace-cache efficiency , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[21]  Hannu Tenhunen,et al.  Evaluating benefits of Globally Asynchronous Locally Synchronous VLSI Architecture , 1998 .

[22]  Norman P. Jouppi,et al.  CACTI: an enhanced cache access and cycle time model , 1996, IEEE J. Solid State Circuits.

[23]  Avi Mendelson,et al.  Micro-operation cache: a power aware frontend for the variable instruction length ISA , 2001, ISLPED '01.

[24]  Eric Sprangle,et al.  Increasing processor performance by implementing deeper pipelines , 2002, ISCA.

[25]  Doug Burger,et al.  Evaluating Future Microprocessors: the SimpleScalar Tool Set , 1996 .

[26]  Emil Talpes,et al.  Mixed-clock issue queue design for energy aware, high-performance cores , 2004, ASP-DAC 2004: Asia and South Pacific Design Automation Conference 2004 (IEEE Cat. No.04EX753).