Circuits for wide-window superscalar processors

Our program benchmarks and simulations of novel circuits indicate that large-window processors are feasible. Using our redesigned superscalar components, a large-window processor implemented in today's technology can achieve an increase of 10-60% (geometric mean of 31%) in program speed compared to today's processors. The processor operates at clock speeds comparable to today's processors, but achieves significantly higher ILP. To measure the impact of a large window on clock speed, we design and simulate new implementations of the logic components that most limit the critical path of our large-window processor: the schedule logic and the wake-up logic. We use log-depth cyclic segmented prefix (CSP) circuits to reimplement these components. Our layouts and simulations of critical paths through these circuits indicate that our large-window processor could be clocked at frequencies exceeding 500 MHz in today's technology. Our commit logic and rename logic can also run at these speeds. To measure the impact of a large window on ILP, we compare two microarchitectures, the first has a 128-instruction window, an 8-wide fetch unit, and 20-wide issue (four integer, branch, multiply, float, and memory units), whereas the second has a 32-instruction window, and a 4-wide fetch unit and is comparable to today's processors. For each, we simulate different window reuse and bypass policies. Our simulations show that the large-window processor achieves significantly higher IPC. This performance increase comes despite the fact that the large-window processor uses a wrap-around window while the small-window processor uses a compressing window, thus effectively increasing its number of outstanding instructions. Furthermore, the large-window processor sometimes pays an extra clock cycle for bypassing.

[1]  Bradley C. Kuszmaul,et al.  A comparison of scalable superscalar processors , 1999, SPAA '99.

[2]  T. Fischer,et al.  Issue Logic For A 600 MHz Out-of-order Execution , 1997, Symposium 1997 on VLSI Circuits.

[3]  Fischer Issue Logic For A 600 MHz Out-of-order Execution , 1997, Symposium 1997 on VLSI Circuits.

[4]  Sanjay J. Patel,et al.  Critical Issues Regarding the Trace Cache Fetch Mechanism , 1997 .

[5]  Nate Kushman,et al.  Performance Nonmonotonicities: A Case Study of the UltraSPARC Processor , 1998 .

[6]  R. E. Kessler,et al.  The Alpha 21264 Microprocessor: Out-of-Order Execution at 600 Mhz , 1998 .

[7]  Bradley C. Kuszmaul,et al.  The Ultrascalar processor-an asymptotically scalable superscalar microarchitecture , 1999, Proceedings 20th Anniversary Conference on Advanced Research in VLSI.

[8]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[9]  Yale N. Patt,et al.  One Billion Transistors, One Uniprocessor, One Chip , 1997, Computer.

[10]  James E. Smith,et al.  Complexity-Effective Superscalar Processors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[11]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[12]  Quinn Jacobson,et al.  Trace processors , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[13]  Scott Devine,et al.  Using the SimOS machine simulator to study complex computer systems , 1997, TOMC.

[14]  S. McFarling Combining Branch Predictors , 1993 .

[15]  Norman P. Jouppi,et al.  Quantifying the Complexity of Superscalar Processors , 2002 .

[16]  Kamran Eshraghian,et al.  Principles of CMOS VLSI Design: A Systems Perspective , 1985 .