The performance potential of fine-grain and coarse-grain parallel architectures

Recent work has shown that pipelining and multiple instruction issuing are architecturally equivalent in their abilities to exploit parallelism, but there has been little work directly comparing the performance of these fine-grain parallel architectures with that of the coarse-grain multiprocessors. Using trace-driven simulations, the authors compare the performance of a superscalar processor and a pipelined processor using dynamic dependence checking with that of a shared memory multiprocessor. For very parallel programs, they find that the fine-grain processors must bypass an unrealistically large number of branches to match the performance of the multiprocessor. When executing programs with a wide range of potential parallelism, the best performance is obtained using a multiprocessor where each individual processor has a fine-grain parallelism of two to four.<<ETX>>

[1]  James E. Smith,et al.  Instruction Issue Logic in Pipelined Supercomputers , 1984, IEEE Trans. Computers.

[2]  Craig J. Mundie,et al.  The Architecture of the Alliant FX/8 Computer , 1986, COMPCON.

[3]  Kevin P. McAuliffe,et al.  The IBM Research Parallel Processor Prototype (RP3): Introduction and Architecture , 1985, ICPP.

[4]  B. Ramakrishna Rau,et al.  The Cydra 5 departmental supercomputer: design philosophies, decisions, and trade-offs , 1989, Computer.

[5]  Hwa C. Torng,et al.  An Instruction Issuing Approach to Enhancing Performance in Multiple Functional Unit Processors , 1986, IEEE Transactions on Computers.

[6]  Andrew R. Pleszkun,et al.  The performance potential of multiple functional unit processors , 1988, [1988] The 15th Annual International Symposium on Computer Architecture. Conference Proceedings.

[7]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[8]  Manoj Kumar,et al.  Measuring Parallelism in Computation-Intensive Scientific/Engineering Applications , 1988, IEEE Trans. Computers.

[9]  Manoj Kumar Effect of storage allocation/reclamation methods on parallelism and storage requirements , 1987, ISCA '87.

[10]  Philip G. Emma,et al.  Characterization of Branch and Data Dependencies in Programs for Evaluating Pipeline Performance , 1987, IEEE Transactions on Computers.

[11]  Marc Snir,et al.  The Performance of Multistage Interconnection Networks for Multiprocessors , 1983, IEEE Transactions on Computers.

[12]  Gurindar S. Sohi,et al.  Tradeoffs in instruction format design for horizontal architectures , 1989, ASPLOS 1989.

[13]  Norman P. Jouppi,et al.  Architectural And Organizational Tradeoffs In The Design Of The Multititan CPU , 1989, The 16th Annual International Symposium on Computer Architecture.

[14]  Edward M. Riseman,et al.  The Inhibition of Potential Parallelism by Conditional Jumps , 1972, IEEE Transactions on Computers.

[15]  Michael J. Flynn,et al.  Detection and Parallel Execution of Independent Instructions , 1970, IEEE Transactions on Computers.

[16]  Thomas L. Casavant,et al.  Non-Deterministic Instruction Time Experiments on the PASM System Prototype , 1988, ICPP.

[17]  Burton J. Smith,et al.  A processor architecture for Horizon , 1988, Proceedings. SUPERCOMPUTING '88.

[18]  Paul Feautrier,et al.  A New Solution to Coherence Problems in Multicache Systems , 1978, IEEE Transactions on Computers.

[19]  George Radin,et al.  The 801 minicomputer , 1982, ASPLOS I.

[20]  Janak H. Patel Performance of Processor-Memory Interconnections for Multiprocessors , 1981, IEEE Transactions on Computers.

[21]  D J Kuck,et al.  Parallel Supercomputing Today and the Cedar Approach , 1986, Science.

[22]  Robert P. Colwell,et al.  A VLIW architecture for a trace scheduling compiler , 1987, ASPLOS 1987.

[23]  Kazuaki Murakami,et al.  SIMP (single Instruction Stream/multiple Instruction Pipelining): A Novel High-speed Single-processor Architecture , 1989, The 16th Annual International Symposium on Computer Architecture.

[24]  Alexandru Nicolau,et al.  Measuring the Parallelism Available for Very Long Instruction Word Architectures , 1984, IEEE Transactions on Computers.

[25]  David L. Kuck,et al.  The Structure of Computers and Computations , 1978 .

[26]  James E. Smith,et al.  Optimal Pipelining in Supercomputers , 1986, ISCA.

[27]  Thomas L. Casavant,et al.  Experimental Application-Driven Architecture Analysis of an SIMD/MIMD Parallel Processing System , 1990, IEEE Trans. Parallel Distributed Syst..

[28]  Joseph A. Fisher,et al.  Trace Scheduling: A Technique for Global Microcode Compaction , 1981, IEEE Transactions on Computers.

[29]  Michael D. Smith,et al.  Limits on multiple instruction issue , 1989, ASPLOS III.

[30]  James E. Smith,et al.  Dynamic instruction scheduling and the Astronautics ZS-1 , 1989, Computer.

[31]  David W. Anderson,et al.  The IBM System/360 model 91: machine philosophy and instruction-handling , 1967 .

[32]  David E. Culler,et al.  Assessing the benefits of fine-grain parallelism in dataflow programs , 1988, Proceedings. SUPERCOMPUTING '88.

[33]  Norman P. Jouppi,et al.  Available instruction-level parallelism for superscalar and superpipelined machines , 1989, ASPLOS 1989.