论文信息 - Circuits for wide-window superscalar processors

Circuits for wide-window superscalar processors

Our program benchmarks and simulations of novel circuits indicate that large-window processors are feasible. Using our redesigned superscalar components, a large-window processor implemented in today's technology can achieve an increase of 10-60% (geometric mean of 31%) in program speed compared to today's processors. The processor operates at clock speeds comparable to today's processors, but achieves significantly higher ILP. To measure the impact of a large window on clock speed, we design and simulate new implementations of the logic components that most limit the critical path of our large-window processor: the schedule logic and the wake-up logic. We use log-depth cyclic segmented prefix (CSP) circuits to reimplement these components. Our layouts and simulations of critical paths through these circuits indicate that our large-window processor could be clocked at frequencies exceeding 500 MHz in today's technology. Our commit logic and rename logic can also run at these speeds. To measure the impact of a large window on ILP, we compare two microarchitectures, the first has a 128-instruction window, an 8-wide fetch unit, and 20-wide issue (four integer, branch, multiply, float, and memory units), whereas the second has a 32-instruction window, and a 4-wide fetch unit and is comparable to today's processors. For each, we simulate different window reuse and bypass policies. Our simulations show that the large-window processor achieves significantly higher IPC. This performance increase comes despite the fact that the large-window processor uses a wrap-around window while the small-window processor uses a compressing window, thus effectively increasing its number of outstanding instructions. Furthermore, the large-window processor sometimes pays an extra clock cycle for bypassing.

[1] Bradley C. Kuszmaul,et al. A comparison of scalable superscalar processors , 1999, SPAA '99.

[2] T. Fischer,et al. Issue Logic For A 600 MHz Out-of-order Execution , 1997, Symposium 1997 on VLSI Circuits.

[3] Fischer. Issue Logic For A 600 MHz Out-of-order Execution , 1997, Symposium 1997 on VLSI Circuits.

[4] Sanjay J. Patel,et al. Critical Issues Regarding the Trace Cache Fetch Mechanism , 1997 .

[5] Nate Kushman,et al. Performance Nonmonotonicities: A Case Study of the UltraSPARC Processor , 1998 .

[6] R. E. Kessler,et al. The Alpha 21264 Microprocessor: Out-of-Order Execution at 600 Mhz , 1998 .

[7] Bradley C. Kuszmaul,et al. The Ultrascalar processor-an asymptotically scalable superscalar microarchitecture , 1999, Proceedings 20th Anniversary Conference on Advanced Research in VLSI.

[8] Ronald L. Rivest,et al. Introduction to Algorithms , 1990 .

[9] Yale N. Patt,et al. One Billion Transistors, One Uniprocessor, One Chip , 1997, Computer.

[10] James E. Smith,et al. Complexity-Effective Superscalar Processors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[11] Dean M. Tullsen,et al. Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[12] Quinn Jacobson,et al. Trace processors , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[13] Scott Devine,et al. Using the SimOS machine simulator to study complex computer systems , 1997, TOMC.

[14] S. McFarling. Combining Branch Predictors , 1993 .

[15] Norman P. Jouppi,et al. Quantifying the Complexity of Superscalar Processors , 2002 .

[16] Kamran Eshraghian,et al. Principles of CMOS VLSI Design: A Systems Perspective , 1985 .