Using value prediction to increase the power of speculative execution hardware

This article presents an experimental and analytical study of value prediction and its impact on speculative execution in superscalar microprocessors. Value prediction is a new paradigm that suggests predicting outcome values of operations (at run-time ) and using these predicted values to trigger the execution of true-data-dependent operations speculatively. As a result, stals to memory locations can be reduced and the amount of instruction-level parallelism can be extended beyond the limits of the program's dataflow graph. This article examines the characteristics of the value prediction concept from two perspectives: (1) the related phenomena that are reflected in the nature of computer programs and (2) the significance of these phenomena to boosting instruction-level parallelism of superscalar microprocessors that support speculative execution. In order to better understand these characteristics, our work combines both analytical and experimental studies.

[1]  Mikko H. Lipasti,et al.  Exceeding the dataflow limit via value prediction , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[2]  Shlomit S. Pinter,et al.  Tango: a hardware-based data prefetching technique for superscalar processors , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[3]  John R. Ellis,et al.  Bulldog: A Compiler for VLIW Architectures , 1986 .

[4]  Todd M. Austin,et al.  Zero-cycle loads: microarchitecture support for reducing load latency , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[5]  Janak H. Patel,et al.  Stride directed prefetching in scalar processors , 1992, MICRO.

[6]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[7]  Andrew R. Pleszkun,et al.  Implementing Precise Interrupts in Pipelined Processors , 1988, IEEE Trans. Computers.

[8]  Stamatis Vassiliadis,et al.  A load-instruction unit for pipelined processors , 1993, IBM J. Res. Dev..

[9]  A. Krishnamoorthy,et al.  Implementation trade-offs in using a restricted data flow architecture in a high performance RISC microprocessor , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[10]  Yale N. Patt,et al.  Improving branch prediction accuracy by reducing pattern history table interference , 1996, Proceedings of the 1996 Conference on Parallel Architectures and Compilation Technique.

[11]  S. McFarling,et al.  Reducing the cost of branches , 1986, ISCA '86.

[12]  Doug Hunt,et al.  Advanced performance features of the 64-bit PA-8000 , 1995, Digest of Papers. COMPCON'95. Technologies for the Information Superhighway.

[13]  Yale N. Patt,et al.  A Comparison Of Dynamic Branch Predictors That Use Two Levels Of Branch History , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[14]  Norman P. Jouppi,et al.  Complexity/performance tradeoffs with non-blocking loads , 1994, ISCA '94.

[15]  Avi Mendelson,et al.  Can program profiling support value prediction? , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[16]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[17]  Mikko H. Lipasti,et al.  Value locality and load value prediction , 1996, ASPLOS VII.

[18]  M. Bergman,et al.  "Introduction to nMOS and cMOS VLSI Systems Design" by Amar Mukherjee, from: Prentice-Hall, Englewood Cliffs, NJ 07632, U.S.A , 1986, Integr..

[19]  Dionisios N. Pnevmatikatos,et al.  Streamlining data cache access with fast address calculation , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[20]  Mike Johnson,et al.  Superscalar microprocessor design , 1991, Prentice Hall series in innovative technology.

[21]  Avi Mendelson,et al.  The effect of instruction fetch bandwidth on value prediction , 1998, ISCA.

[22]  Yvon Jégou,et al.  Speculative prefetching , 1993, ICS '93.

[23]  Robert M. Keller,et al.  Look-Ahead Processors , 1975, CSUR.

[24]  Jean-Loup Baer,et al.  An effective on-chip preloading scheme to reduce data access penalty , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[25]  Ken Kennedy,et al.  Software prefetching , 1991, ASPLOS IV.

[26]  John H. Edmondson,et al.  Superscalar instruction execution in the 21164 Alpha microprocessor , 1995, IEEE Micro.

[27]  Alan E. Charlesworth,et al.  An Approach to Scientific Array Processing: The Architectural Design of the AP-120B/FPS-164 Family , 1981, Computer.

[28]  David Bernstein,et al.  Compiler techniques for data prefetching on the PowerPC , 1995, PACT.

[29]  Jean-Loup Baer,et al.  A performance study of software and hardware data prefetching schemes , 1994, ISCA '94.

[30]  Monica S. Lam,et al.  RETROSPECTIVE : Software Pipelining : An Effective Scheduling Technique for VLIW Machines , 1998 .

[31]  Trevor Mudge,et al.  Hardware support for hiding cache latency , 1993 .

[32]  David R. Ditzel,et al.  Branch folding in the CRISP microprocessor: reducing branch delay to zero , 1987, ISCA '87.

[33]  Andreas Moshovos,et al.  A Dynamic Approach to Improve the Accuracy of Data Speculation , 1996 .

[34]  David W. Wall,et al.  Limits of instruction-level parallelism , 1991, ASPLOS IV.

[35]  F. Gabbay Speculative Execution based on Value Prediction Research Proposal towards the Degree of Doctor of Sciences , 1996 .

[36]  B. Ramakrishna Rau,et al.  Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing , 1981, MICRO 14.

[37]  James E. Smith,et al.  A study of scalar compilation techniques for pipelined supercomputers , 1987, ASPLOS.

[38]  Joseph Allen Fisher,et al.  The Optimization of Horizontal Microcode within and Beyond Basic Blocks: an Application of Processor Scheduling with Resources , 2018 .

[39]  Alan Jay Smith,et al.  Branch Prediction Strategies and Branch Target Buffer Design , 1995, Computer.

[40]  John R. Ellis,et al.  Bulldog: a compiler for vliw architectures (parallel computing, reduced-instruction-set, trace scheduling, scientific) , 1985 .

[41]  K. Mani Chandy,et al.  A comparison of list schedules for parallel processing systems , 1974, Commun. ACM.

[42]  José González,et al.  Memory Address Prediction for Data Speculation , 1997, Euro-Par.

[43]  Jean-Loup Baer,et al.  Effective Hardware Based Data Prefetching for High-Performance Processors , 1995, IEEE Trans. Computers.

[44]  Joseph A. Fisher,et al.  Predicting conditional branch directions from previous runs of a program , 1992, ASPLOS V.

[45]  S. Vassiliadis,et al.  SCISM: A scalable compound instruction set machine , 1994, IBM J. Res. Dev..

[46]  James E. Smith,et al.  The performance potential of data dependence speculation and collapsing , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[47]  Janak H. Patel,et al.  Stride directed prefetching in scalar processors , 1992, MICRO 1992.

[48]  Paul L. Hazan Computing and the Handicapped , 1981, Computer.

[49]  Bruce D. Shriver,et al.  Some Experiments in Local Microcode Compaction for Horizontal Machines , 1981, IEEE Transactions on Computers.

[50]  Alexander Aiken,et al.  Perfect Pipelining: A New Loop Parallelization Technique , 1988, ESOP.

[51]  James E. Smith,et al.  A study of branch prediction strategies , 1981, ISCA '98.

[52]  Gurindar S. Sohi,et al.  ARB: A Hardware Mechanism for Dynamic Reordering of Memory References , 1996, IEEE Trans. Computers.

[53]  Trung A. Diep,et al.  Performance evaluation of the PowerPC 620 microarchitecture , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[54]  Bob Blainey,et al.  Instruction scheduling in the TOBEY compiler , 1994, IBM J. Res. Dev..

[55]  R. M. Tomasulo,et al.  An efficient algorithm for exploiting multiple arithmetic units , 1995 .

[56]  José González,et al.  Speculative execution via address prediction and data prefetching , 1997, ICS '97.

[57]  Gurindar S. Sohi,et al.  Multiscalar processors , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[58]  Alexandru Nicolau,et al.  Run-Time Disambiguation: Coping with Statically Unpredictable Dependencies , 1989, IEEE Trans. Computers.

[59]  Yale N. Patt,et al.  Alternative Implementations of Two-Level Adaptive Branch Prediction , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[60]  Monica Sin-Ling Lam,et al.  A Systolic Array Optimizing Compiler , 1989 .

[61]  Yale N. Patt,et al.  A comparison of dynamic branch predictors that use two levels of branch history , 1993, ISCA '93.