Instruction Window Size Trade-Offs and Characterization of Program Parallelism

Detecting independent operations is a prime objective for computers that are capable of issuing and executing multiple operations simultaneously. The number of instructions that are simultaneously examined for detecting those that are independent is the scope of concurrency detection. The authors present an analytical model for predicting the performance impact of varying the scope of concurrency detection as a function of available resources, such as number of pipelines in a superscalar architecture. The model developed can show where a performance bottleneck might be: insufficient resources to exploit discovered parallelism, insufficient instruction stream parallelism, or insufficient scope of concurrency detection. The cost associated with speculative execution is examined via a set of probability distributions that characterize the inherent parallelism in the instruction stream. These results were derived using traces from a Multiflow TRACE SCHEDULING compacting FORTRAN 77 and C compilers. The experiments provide misprediction delay estimates for 11 common application-level benchmarks under scope constraints, assuming speculative, out-of-order execution and run time scheduling. The throughput prediction of the analytical model is shown to be close to the measured static throughput of the compiler output. >

[1]  Alexandru Nicolau,et al.  Measuring the Parallelism Available for Very Long Instruction Word Architectures , 1984, IEEE Transactions on Computers.

[2]  Pradeep Dubey,et al.  Exploiting fine-grain concurrency: Analytical insights in superscalar processor design , 1991 .

[3]  Alexandru Nicolau,et al.  Uniform Parallelism Exploitation in Ordinary Programs , 1985, ICPP.

[4]  David W. Wall,et al.  Limits of instruction-level parallelism , 1991, ASPLOS IV.

[5]  Edward S. Davidson,et al.  Highly concurrent scalar processing , 1986, ISCA 1986.

[6]  Edward M. Riseman,et al.  The Inhibition of Potential Parallelism by Conditional Jumps , 1972, IEEE Transactions on Computers.

[7]  Michael Allen Schuette Exploitation of instruction-level parallelism for detection of processor execution errors , 1991 .

[8]  Michael J. Flynn,et al.  Detection and Parallel Execution of Independent Instructions , 1970, IEEE Transactions on Computers.

[9]  Michael J. Flynn,et al.  Branch Strategies: Modeling and Optimization , 1991, IEEE Trans. Computers.

[10]  Joseph A. Fisher,et al.  Trace Scheduling: A Technique for Global Microcode Compaction , 1981, IEEE Transactions on Computers.

[11]  Yoichi Muraoka,et al.  On the Number of Operations Simultaneously Executable in Fortran-Like Programs and Their Resulting Speedup , 1972, IEEE Transactions on Computers.

[12]  Hwa C. Torng,et al.  An Instruction Issuing Approach to Enhancing Performance in Multiple Functional Unit Processors , 1986, IEEE Transactions on Computers.

[13]  H. T. Kung Why systolic architectures? , 1982, Computer.

[14]  James E. Smith,et al.  Instruction Issue Logic in Pipelined Supercomputers , 1984, IEEE Trans. Computers.

[15]  Richard R. Oehler,et al.  IBM RISC System/6000 Processor Architecture , 1990, IBM J. Res. Dev..

[16]  Robert G. Wedig Detection of concurrency in directly executed language instruction streams , 1982 .

[17]  Toshio Nakatani,et al.  Using a lookahead window in a compaction-based parallelizing compiler , 1991, SIGM.

[18]  George Cybenko,et al.  Supercomputer performance evaluation and the Perfect Benchmarks , 1990, ICS '90.

[19]  Pradeep K. Dubey,et al.  Dynamic Trace Analysis for Analytic Modeling of Suberscalar Performance , 1994, Perform. Evaluation.

[20]  Michael Shebanow,et al.  Single instruction stream parallelism is greater than two , 1991, ISCA '91.

[21]  Monica S. Lam,et al.  Architecture and Compiler Tradeoffs for a Long Instruction Word Microprocessor , 1989, ASPLOS.

[22]  Yale N. Patt,et al.  HPS, a new microarchitecture: rationale and introduction , 1985, MICRO 18.

[23]  Michael D. Smith,et al.  Limits on multiple instruction issue , 1989, ASPLOS 1989.

[24]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[25]  Robert P. Colwell,et al.  Architecture and implementation of a VLIW supercomputer , 1990, Proceedings SUPERCOMPUTING '90.

[26]  Shlomo Weiss,et al.  Instruction issue logic for pipelined supercomputers , 1984, ISCA 1984.

[27]  Robert P. Colwell,et al.  A VLIW architecture for a trace scheduling compiler , 1987, ASPLOS 1987.