A Comparison of Asymptotically Scalable Superscalar Processors

Abstract. The poor scalability of existing superscalar processors has been of great concern to the computer engineering community. In particular, the critical-path lengths of many components in existing implementations grow as Θ(n2) where n is the fetch width, the issue width, or the window size. This paper describes two scalable processor architectures, Ultrascalar I and Ultrascalar II, and compares their VLSI complexities (gate delays, wire-length delays, and area.) Both processors are implemented by a large collection of ALUs with controllers (together called execution stations ) connected together by a network of parallel-prefix tree circuits. A fat-tree network connects an interleaved cache to the execution stations. These networks provide the full functionality of superscalar processors including renaming, out-of-order execution, and speculative execution. The difference between the processors is in the mechanism used to transmit register values from one execution station to another. Both architectures use a parallel-prefix tree to communicate the register values between the execution stations. Ultrascalar I transmits an entire copy of the register file to each station, and the station chooses which register values it needs based on the instruction. Ultrascalar I uses an H-tree layout. Ultrascalar II uses a mesh-of-trees and carefully sends only the register values that will actually be needed by each subtree to reduce the number of wires required on the chip. The complexity results are as follows: The complexity is described for a processor which has an instruction-set architecture containing L logical registers and can execute n instructions in parallel. The chip provides enough memory bandwidth to execute up to M(n) memory operations per cycle. (M is assumed to have a certain regularity property.) In all the processors, the VLSI area is the square of the wire delay. Ultrascalar I has gate delay O(log  n) and wire delay \tauwires = \Theta(\sqrt{n}L) if $M(n)$ is $O(n^{1/2-\varepsilon})$, \tauwires = \Theta(\sqrt{n}(L+\log n)) if $M(n)$ is $\Theta(n^{1/2})$,\tauwires = \Theta(\sqrt{n}L+M(n)) if $M(n)$ is $\Omega(n^{1/2+\varepsilon})$for ɛ>0 . Ultrascalar II has gate delay Θ(log  L+log  n) . The wire delay is Θ(n) , which is optimal for n=O(L) . Thus, Ultrascalar II dominates Ultrascalar I for n=O(L2) , otherwise Ultrascalar I dominates Ultrascalar II. We introduce a hybrid ultrascalar that uses a two-level layout scheme: Clusters of execution stations are layed out using the Ultrascalar II mesh-of-trees layout, and then the clusters are connected together using the H-tree layout of Ultrascalar I. For the hybrid (in which n≥ L ), the wire delay is Θ(\sqrt nL+M(n)) , which is optimal. For n≥ L , the hybrid dominates both Ultrascalar I and Ultrascalar II. We also present an empirical comparison of Ultrascalar I and the hybrid, both layed out using the Magic VLSI editor. For a processor that has 32 32-bit registers and a simple integer ALU, the hybrid requires about 11 times less area.

[1]  James E. Smith,et al.  Complexity-Effective Superscalar Processors , 1997, ISCA.

[2]  Norman P. Jouppi,et al.  Quantifying the Complexity of Superscalar Processors , 2002 .

[3]  Monica S. Lam,et al.  Limits of control flow on parallelism , 1992, ISCA '92.

[4]  Fischer Issue Logic For A 600 MHz Out-of-order Execution , 1997, Symposium 1997 on VLSI Circuits.

[5]  Yale N. Patt,et al.  Increasing the instruction fetch rate via multiple branch prediction and a branch address cache , 1993, ICS '93.

[6]  T. Fischer,et al.  Issue Logic For A 600 MHz Out-of-order Execution , 1997, Symposium 1997 on VLSI Circuits.

[7]  Bradley C. Kuszmaul,et al.  The Ultrascalar processor-an asymptotically scalable superscalar microarchitecture , 1999, Proceedings 20th Anniversary Conference on Advanced Research in VLSI.

[8]  F. Leighton,et al.  Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes , 1991 .

[9]  Charles E. Leiserson,et al.  Fat-trees: Universal networks for hardware-efficient supercomputing , 1985, IEEE Transactions on Computers.

[10]  Peter Steenkiste,et al.  A simple interprocedural register allocation algorithm and its effectiveness for LISP , 1989, TOPL.

[11]  A. Mendelson,et al.  Improving Trace Cache Effectiveness with Branch Promotion and Trace Packing , 1998, ISCA 1998.

[12]  Richard L. Sites,et al.  Alpha Architecture Reference Manual , 1995 .

[13]  Walter S. Scott,et al.  Magic: A VLSI Layout System , 1984, 21st Design Automation Conference Proceedings.

[14]  William J. Dally,et al.  Digital systems engineering , 1998 .

[15]  W. Frable Online publication , 2002 .

[16]  Bradley C. Kuszmaul,et al.  Circuits for wide-window superscalar processors , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[17]  T.H. Lee,et al.  A 600 MHz superscalar RISC microprocessor with out-of-order execution , 1997, 1997 IEEE International Solids-State Circuits Conference. Digest of Technical Papers.

[18]  Yale N. Patt,et al.  One Billion Transistors, One Uniprocessor, One Chip , 1997, Computer.

[19]  Eric Rotenberg,et al.  Trace cache: a low latency approach to high bandwidth instruction fetching , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[20]  Bradley C. Kuszmaul,et al.  Cyclic Segmented Parallel Prefix , 1998 .

[21]  David Wheeler Programme organization and initial orders for the EDSAC , 1950, Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences.