Enhancing memory-level parallelism via recovery-free value prediction

The ever-increasing computational power of contemporary microprocessors reduces the execution time spent on arithmetic computations (i.e., the computations not involving slow memory operations such as cache misses) significantly. Therefore, for memory-intensive workloads, it becomes more important to overlap multiple cache misses than to overlap slow memory operations with other computations. In this paper, we propose a novel technique to parallelize sequential cache misses, thereby increasing memory-level parallelism (MLP). Our idea is based on value prediction, which was proposed originally as an instruction-level parallelism (ILP) optimization to break true data dependencies. In this paper, we advocate value prediction in its capability to enhance MLP instead of ILP. We propose using value prediction and value-speculative execution only for prefetching so that not only the complex prediction validation and misprediction recovery mechanisms are avoided, but better performance can also be achieved for memory-intensive workloads. The minor hardware modifications that are required also enable aggressive memory disambiguation for prefetching. The experimental results show that our technique enhances MLP effectively and achieves significant speedups, even with a simple stride value predictor.

[1]  Kenneth C. Yeager The Mips R10000 superscalar microprocessor , 1996, IEEE Micro.

[2]  James E. Smith,et al.  The predictability of data values , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[3]  Eric Rotenberg,et al.  A large, fast instruction window for tolerating cache misses , 2002, ISCA.

[4]  Uri C. Weiser,et al.  Correlated load-address predictors , 1999, ISCA.

[5]  Dirk Grunwald,et al.  A stateless, content-directed data prefetching mechanism , 2002, ASPLOS X.

[6]  John L. Henning SPEC CPU2000: Measuring CPU Performance in the New Millennium , 2000, Computer.

[7]  Huiyang Zhou,et al.  Performance Modeling of Memory Latency Hiding Techniques , 2002 .

[8]  John Paul Shen,et al.  Memory latency-tolerance approaches for Itanium processors: out-of-order execution vs. speculative precomputation , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[9]  Youfeng Wu,et al.  Efficient discovery of regular stride patterns in irregular programs and its use in compiler prefetching , 2002, PLDI '02.

[10]  Kai Wang,et al.  Highly accurate data value prediction using hybrid predictors , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[11]  Eric Sprangle,et al.  Increasing processor performance by implementing deeper pipelines , 2002, ISCA.

[12]  Trevor N. Mudge,et al.  Author retrospective improving data cache performance by pre-executing instructions under a cache miss , 1997, International Conference on Supercomputing.

[13]  Eric Rotenberg,et al.  Trace cache: a low latency approach to high bandwidth instruction fetching , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[14]  Mikko H. Lipasti,et al.  Exceeding the dataflow limit via value prediction , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[15]  Craig Zilles,et al.  Execution-based prediction using speculative slices , 2001, ISCA 2001.

[16]  José González,et al.  Control-flow speculation through value prediction for superscalar processors , 1999, 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425).

[17]  Mikko H. Lipasti,et al.  Value locality and load value prediction , 1996, ASPLOS VII.

[18]  F. Gabbay Speculative Execution based on Value Prediction Research Proposal towards the Degree of Doctor of Sciences , 1996 .

[19]  Rajiv Gupta,et al.  Predictability of load/store instruction latencies , 1993, Proceedings of the 26th Annual International Symposium on Microarchitecture.

[20]  Onur Mutlu,et al.  Runahead execution: an alternative to very large instruction windows for out-of-order processors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[21]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[22]  Gurindar S. Sohi,et al.  Speculative data-driven multithreading , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[23]  James E. Smith,et al.  Improving branch predictors by correlating on data values , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[24]  Douglas J. Joseph,et al.  Prefetching Using Markov Predictors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[25]  José González,et al.  Speculative execution via address prediction and data prefetching , 1997, ICS '97.

[26]  Huiyang Zhou,et al.  Enhancing Memory-Level Parallelism via Recovery-Free Value Prediction , 2005, IEEE Trans. Computers.

[27]  Martin C. Carlisle,et al.  Olden: parallelizing programs with dynamic data structures on distributed-memory machines , 1996 .

[28]  Chi-Keung Luk,et al.  Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors , 2001, Proceedings 28th Annual International Symposium on Computer Architecture.

[29]  Huiyang Zhou,et al.  Detecting global stride locality in value streams , 2003, ISCA '03.

[30]  Christopher Hughes,et al.  Speculative precomputation: long-range prefetching of delinquent loads , 2001, ISCA 2001.

[31]  Pen-Chung Yew,et al.  On some implementation issues for value prediction on wide-issue ILP processors , 2000, Proceedings 2000 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00622).

[32]  Glenn Reinman,et al.  A Comparative Survey of Load Speculation Architectures , 2000, J. Instr. Level Parallelism.