The microarchitecture of superscalar processors

Superscalar processing is the latest in along series of innovations aimed at producing ever-faster microprocessors. By exploiting instruction-level parallelism, superscalar processors are capable of executing more than one instruction in a clock cycle. This paper discusses the microarchitecture of superscalar processors. We begin with a discussion of the general problem solved by superscalar processors: converting an ostensibly sequential program into a more parallel one. The principles underlying this process, and the constraints that must be met, are discussed. The paper then provides a description of the specific implementation techniques used in the important phases of superscalar processing. The major phases include: (1) instruction fetching and conditional branch processing, (2) the determination of data dependences involving register values, (3) the initiation, or issuing, of instructions for parallel execution, (4) the communication of data values through memory via loads and stores, and (5) committing the process state in correct order so that precise interrupts can be supported. Examples of recent superscalar microprocessors, the MIPS R10000, the DEC 21164, and the AMD K5 are used to illustrate a variety of superscalar methods.

[1]  Sholom M. Weiss,et al.  Power and power PC - principles, architecture, implementation , 1994 .

[2]  S. L. Zelen Rationale and Introduction , 1987 .

[3]  J A Fisher,et al.  Instruction-Level Parallel Processing , 1991, Science.

[4]  Andrew R. Pleszkun,et al.  Implementing Precise Interrupts in Pipelined Processors , 1988, IEEE Trans. Computers.

[5]  Paul F. Reynolds,et al.  Parallel Operations , 1989 .

[6]  S SohiGurindar Instruction Issue Logic for High-Performance, Interruptible, Multiple Functional Unit, Pipelined Computers , 1990 .

[7]  Norman P. Jouppi,et al.  Complexity/performance tradeoffs with non-blocking loads , 1994, ISCA '94.

[8]  R. M. Tomasulo,et al.  An efficient algorithm for exploiting multiple arithmetic units , 1995 .

[9]  James E. Smith,et al.  Optimal pipelining in supercomputers , 1986, ISCA 1986.

[10]  Robert P. Colwell,et al.  A VLIW architecture for a trace scheduling compiler , 1987, ASPLOS.

[11]  Richard M. Russell,et al.  The CRAY-1 computer system , 1978, CACM.

[12]  Peter Yan-Tek Hsu Designing the TFP microprocessor , 1994, IEEE Micro.

[13]  James E. Smith,et al.  The ZS-1 central processor , 1987, ASPLOS.

[14]  C. G. Bell Multis: A New Class of Multiprocessor Computers , 1985, Science.

[15]  Mauricio J. Serrano,et al.  The impact of unresolved branches on branch prediction scheme performance , 1994, ISCA '94.

[16]  Christopher C. Hsiung,et al.  Cray X-MP: the birth of a supercomputer , 1989, Computer.

[17]  Michael D. Smith,et al.  Limits on multiple instruction issue , 1989, ASPLOS III.

[18]  Michel Dubois,et al.  Synchronization, coherence, and event ordering in multiprocessors , 1988, Computer.

[19]  R. J. Joenk,et al.  IBM journal of research and development: information for authors , 1978 .

[20]  Michael J. Flynn,et al.  Detection and Parallel Execution of Independent Instructions , 1970, IEEE Transactions on Computers.

[21]  Gurindar S. Sohi,et al.  ARB: A Hardware Mechanism for Dynamic Reordering of Memory References , 1996, IEEE Trans. Computers.

[22]  Burzin A. Patel,et al.  Optimization of instruction fetch mechanisms for high issue rates , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[23]  David Kroft,et al.  Lockup-free instruction fetch/prefetch cache organization , 1998, ISCA '81.

[24]  Tom Jones Engineering design of the Convex C2 , 1989, Computer.

[25]  C. R. Moore The PowerPC 601 microprocessor , 1993, Digest of Papers. Compcon Spring.

[26]  Alan Jay Smith,et al.  Branch Prediction Strategies and Branch Target Buffer Design , 1995, Computer.

[27]  Yale N. Patt,et al.  The effect of speculatively updating branch history on branch prediction accuracy, revisited , 1994, MICRO 27.

[28]  J. E. Thornton,et al.  Parallel operation in the control data 6600 , 1964, AFIPS '64 (Fall, part II).

[29]  Gurindar S. Sohi,et al.  Multiscalar processors , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[30]  Joseph T. Rahmeh,et al.  Improving the accuracy of dynamic branch prediction using branch correlation , 1992, ASPLOS V.

[31]  D.R. Kaeli,et al.  Branch history table prediction of moving target branches due to subroutine returns , 1991, [1991] Proceedings. The 18th Annual International Symposium on Computer Architecture.

[32]  Bell Cg,et al.  Multis: a new class of multiprocessor computers. , 1985 .

[33]  Yale N. Patt,et al.  Alternative implementations of two-level adaptive branch prediction , 1992, ISCA '92.

[34]  Gurindar S. Sohi,et al.  High-bandwidth data memory systems for superscalar processors , 1991, ASPLOS IV.

[35]  Kunle Olukotun,et al.  Exploring the design space for a shared-cache multiprocessor , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[36]  Andris Padegs,et al.  Architecture of the IBM system/370 , 1978, CACM.

[37]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[38]  Kevin O'Brien,et al.  Single-program speculative multithreading (SPSM) architecture: compiler-assisted fine-grained multithreading , 1995, PACT.

[39]  Joseph A. Fisher,et al.  Predicting conditional branch directions from previous runs of a program , 1992, ASPLOS V.

[40]  Michael J. Flynn,et al.  Optimal Pipelining , 1990, J. Parallel Distributed Comput..

[41]  David W. Wall,et al.  Limits of instruction-level parallelism , 1991, ASPLOS IV.

[42]  L. J. Boland,et al.  The IBM system/360 model 91: storage system , 1967 .

[43]  Jeff Yetter,et al.  Performance features of the PA7100 microprocessor , 1993, IEEE Micro.

[44]  Monica S. Lam,et al.  Limits of control flow on parallelism , 1992, ISCA '92.

[45]  C. M. Berners-Lee Planning a Computer System , 1962 .

[46]  Gregory F. Grohoski,et al.  Machine Organization of the IBM RISC System/6000 Processor , 1990, IBM J. Res. Dev..

[47]  Yale N. Patt,et al.  Critical issues regarding HPS, a high performance microarchitecture , 1985, MICRO 18.

[48]  Charles R. Moore,et al.  The Power PC 601 microprocessor , 1993, IEEE Micro.

[49]  Norman P. Jouppi,et al.  Available instruction-level parallelism for superscalar and superpipelined machines , 1989, ASPLOS III.

[50]  Norman P. Jouppi,et al.  Hardware/software tradeoffs for increased performance , 1982, ASPLOS I.

[51]  H. M. Ernst,et al.  Planning a Computer System , 1964 .

[52]  Robert M. Keller,et al.  Look-Ahead Processors , 1975, CSUR.

[53]  Richard R. Oehler,et al.  IBM RISC System/6000 Processor Architecture , 1990, IBM J. Res. Dev..

[54]  Yale N. Patt,et al.  A two-level approach to making class predictions , 2003, 36th Annual Hawaii International Conference on System Sciences, 2003. Proceedings of the.

[55]  David W. Anderson,et al.  The IBM System/360 model 91: machine philosophy and instruction-handling , 1967 .

[56]  Yale N. Patt,et al.  Checkpoint Repair for High-Performance Out-of-Order Execution Machines , 1987, IEEE Transactions on Computers.

[57]  Michael Allen,et al.  Organization of the Motorola 88110 superscalar RISC microprocessor , 1992, IEEE Micro.

[58]  Yale N. Patt,et al.  HPS, a new microarchitecture: rationale and introduction , 1985, MICRO 18.

[59]  S. McFarling Combining Branch Predictors , 1993 .

[60]  B. Ramakrishna Rau,et al.  The Cydra 5 departmental supercomputer: design philosophies, decisions, and trade-offs , 1989, Computer.