Value locality and speculative execution

pipelined memory subsystem which allows any number of nonblocking misses (this is more aggressive than the actual 21164 and will tend to understate the benefits of value predictionbased speculation). Second, in order to allow value prediction-based speculation to occur, we must verify predictions by comparing the predicted value to the actual value computed by the ALU. This comparison requires an extra stage before writeback. The third modification, the addition of the reissue buffer, allows us to buffer instruction dispatch groups that contain value-predicted instructions. With this feature, we are able to redispatch instructions when a misprediction occurs with only a single-cycle penalty. Figure 2-1. PPC 620 and 620+ Block Diagram. Buffer sizes shown as (620/620+). LSU SCFX RS (2/4) Comp Unit (16/32) Fetch/Dispatch Unit

[1]  Chau-Wen Tseng,et al.  Compiler optimizations for improving data locality , 1994, ASPLOS VI.

[2]  David W. Wall,et al.  Link-time optimization of address calculation on a 64-bit architecture , 1994, PLDI '94.

[3]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[4]  Scott A. Mahlke,et al.  Data access microarchitectures for superscalar processors with compiler-assisted data prefetching , 1991, MICRO 24.

[5]  Mikko H. Lipasti,et al.  Exceeding the dataflow limit via value prediction , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[6]  Michael D. Smith,et al.  A comparative analysis of schemes for correlated branch prediction , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[7]  Christian Piguet,et al.  Microprocessor design , 1997 .

[8]  Yale N. Patt,et al.  A two-level approach to making class predictions , 2003, 36th Annual Hawaii International Conference on System Sciences, 2003. Proceedings of the.

[9]  K. Kavi Cache Memories Cache Memories in Uniprocessors. Reading versus Writing. Improving Performance , 2022 .

[10]  Eric Rotenberg,et al.  Trace cache: a low latency approach to high bandwidth instruction fetching , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[11]  John Paul Shen,et al.  Speculative disambiguation: a compilation technique for dynamic memory disambiguation , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[12]  Edward M. Riseman,et al.  The Inhibition of Potential Parallelism by Conditional Jumps , 1972, IEEE Transactions on Computers.

[13]  Thomas Thomas,et al.  The PowerPC 620 microprocessor: a high performance superscalar RISC microprocessor , 1995, Digest of Papers. COMPCON'95. Technologies for the Information Superhighway.

[14]  James E. Smith,et al.  Complexity-Effective Superscalar Processors , 1997, ISCA.

[15]  S. Richardson Caching Function Results: Faster Arithmetic by Avoiding Unnecessary Computation , 1992 .

[16]  Samuel Pollock Harbison A computer architecture for the dynamic optimization of high-level language programs , 1980 .

[17]  P. Bannon,et al.  Internal architecture of Alpha 21164 microprocessor , 1995, Digest of Papers. COMPCON'95. Technologies for the Information Superhighway.

[18]  Trung A. Diep,et al.  VMW: A Visualization-Based Microarchitecture Workbench , 1995, Computer.

[19]  Burzin A. Patel,et al.  Optimization of instruction fetch mechanisms for high issue rates , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[20]  Monica S. Lam,et al.  Limits of control flow on parallelism , 1992, ISCA '92.

[21]  Gary S. Tyson,et al.  A modified approach to data cache management , 1995, MICRO 1995.

[22]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[23]  Mikko H. Lipasti,et al.  Value locality and load value prediction , 1996, ASPLOS VII.

[24]  F. Gabbay Speculative Execution based on Value Prediction Research Proposal towards the Degree of Doctor of Sciences , 1996 .

[25]  Mikko H. Lipasti,et al.  The Performance Potential of Value and Dependence Prediction , 1997, Euro-Par.

[26]  Ken Kennedy,et al.  Software prefetching , 1991, ASPLOS IV.

[27]  Mike Johnson,et al.  Superscalar microprocessor design , 1991, Prentice Hall series in innovative technology.

[28]  Kevin B. Theobald,et al.  On the limits of program parallelism and its smoothability , 1992, MICRO 1992.

[29]  Kunle Olukotun,et al.  The case for a single-chip multiprocessor , 1996, ASPLOS VII.

[30]  S. McFarling Combining Branch Predictors , 1993 .

[31]  Trung A. Diep,et al.  Performance evaluation of the PowerPC 620 microarchitecture , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[32]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and pre , 1990, ISCA 1990.

[33]  Dionisios N. Pnevmatikatos,et al.  Streamlining data cache access with fast address calculation , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[34]  Apostolos Dollas,et al.  Predicting and precluding problems with memory latency , 1994, IEEE Micro.

[35]  David W. Wall,et al.  Limits of instruction-level parallelism , 1991, ASPLOS IV.

[36]  Samuel P. Harbison An architectural alternative to optimizing compilers , 1982, ASPLOS I.

[37]  Jean-Loup Baer,et al.  A performance study of software and hardware data prefetching schemes , 1994, ISCA '94.

[38]  Duncan H. Lawrie,et al.  On the Performance Enhancement of Paging Systems Through Program Analysis and Transformations , 1981, IEEE Transactions on Computers.

[39]  T. Ozawa,et al.  Cache miss heuristics and preloading techniques for general-purpose programs , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[40]  Mikko H. Lipasti,et al.  Approaching 10 IPC via Superspeculation , 1997 .

[41]  Kevin McGrath,et al.  Eliminating operand read latency , 1996, CARN.

[42]  Manoj Franklin,et al.  The multiscalar architecture , 1993 .

[43]  R. P. Colwell,et al.  A 0.6 /spl mu/m BiCMOS processor with dynamic execution , 1995, Proceedings ISSCC '95 - International Solid-State Circuits Conference.

[44]  Mikko H. Lipasti,et al.  Partial resolution in branch target buffers , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[45]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[46]  John Paul Shen,et al.  The intrinsic bandwidth requirements of ordinary programs , 1996, ASPLOS VII.

[47]  Scott A. Mahlke,et al.  Dynamic memory disambiguation using the memory conflict buffer , 1994, ASPLOS VI.

[48]  Norman P. Jouppi,et al.  Architectural And Organizational Tradeoffs In The Design Of The Multititan CPU , 1989, The 16th Annual International Symposium on Computer Architecture.

[49]  Trevor Mudge,et al.  Hardware support for hiding cache latency , 1993 .