The Ultrascalar processor-an asymptotically scalable superscalar microarchitecture

The poor scalability of existing superscalar processors has been of great concern to the computer engineering community. In particular the critical-path lengths of many components in existing implementations grow as /spl Theta/(n/sup 2/) where n is the fetch width, the issue width, or the window size. This paper presents a novel implementation, called the Ultrascalar processor, that dramatically reduces the asymptotic critical-path length of a superscalar processor. The processor is implemented by a large collection of ALUs with controllers (together called execution stations) connected together by a network of parallel-prefix tree circuits. A fat-tree network connects an interleaved cache to the execution stations. These networks provide the full functionality of superscalar processors including renaming, out-of-order execution, and speculative execution. The Ultrascalar's critical-path length due to gate delays is /spl tau//sub gates/=/spl Theta/(log n). The wire delays and chip size depend on the provided memory bandwidth and the layout. If the provided memory bandwidth is M(n) memory operations per clock cycle then, using an H-tree VLSI layout, the critical-path length due to wire delay (speed-of-light delay) is /spl tau//sub wires/={/spl Theta/(n/sup 1/2/) if M(n) is O(n/sup 1/2-/spl epsiv//) for /spl epsiv/>0, [optimal]; {/spl Theta/(n/sup 1/2/log n) if M(n) is /spl Theta/(n/sup 1/2/), [near optimal]; and {/spl Theta/(M(n)) if M(n) is /spl Omega/(n/sup 1/2+/spl epsiv//) for /spl epsiv/>0, [optimal] (with M suitably constrained.) The area is the square of the wire delay.

[1]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[2]  Richard M. Russell,et al.  The CRAY-1 computer system , 1978, CACM.

[3]  Thompson The VLSI Complexity of Sorting , 1983, IEEE Transactions on Computers.

[4]  Jeffrey D Ullma Computational Aspects of VLSI , 1984 .

[5]  Walter S. Scott,et al.  Magic: A VLSI Layout System , 1984, 21st Design Automation Conference Proceedings.

[6]  Charles E. Leiserson,et al.  Fat-trees: Universal networks for hardware-efficient supercomputing , 1985, IEEE Transactions on Computers.

[7]  Monica S. Lam,et al.  Limits of control flow on parallelism , 1992, ISCA '92.

[8]  Yale N. Patt,et al.  Increasing the instruction fetch rate via multiple branch prediction and a branch address cache , 1993, ICS '93.

[9]  Ivan E. Sutherland,et al.  The counterflow pipeline processor architecture , 1994, IEEE Design & Test of Computers.

[10]  Gurindar S. Sohi,et al.  Multiscalar processors , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[11]  Multiscalar processors , 1995, ISCA 1995.

[12]  W. Daniel Hillis,et al.  The Network Architecture of the Connection Machine CM-5 , 1996, J. Parallel Distributed Comput..

[13]  Eric Rotenberg,et al.  Trace cache: a low latency approach to high bandwidth instruction fetching , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[14]  James E. Smith,et al.  Complexity-Effective Superscalar Processors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[15]  Yale N. Patt,et al.  One Billion Transistors, One Uniprocessor, One Chip , 1997, Computer.

[16]  Fischer Issue Logic For A 600 MHz Out-of-order Execution , 1997, Symposium 1997 on VLSI Circuits.

[17]  T.H. Lee,et al.  A 600 MHz superscalar RISC microprocessor with out-of-order execution , 1997, 1997 IEEE International Solids-State Circuits Conference. Digest of Technical Papers.

[18]  D. Burger,et al.  Datascalar Architectures , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[19]  Yale N. Patt,et al.  Improving trace cache effectiveness with branch promotion and trace packing , 1998, ISCA.

[20]  T. Fischer,et al.  Issue Logic For A 600 MHz Out-of-order Execution , 1997, Symposium 1997 on VLSI Circuits.

[21]  Bradley C. Kuszmaul,et al.  Cyclic Segmented Parallel Prefix , 1998 .

[22]  William J. Dally,et al.  Digital systems engineering , 1998 .