Address-value delta (AVD) prediction: increasing the effectiveness of runahead execution by exploiting regular memory allocation patterns

While runahead execution is effective at parallelizing independent long-latency cache misses, it is unable to parallelize dependent long-latency cache misses. To overcome this limitation, this paper proposes a novel technique, address-value delta (AVD) prediction. An AVD predictor keeps track of the address (pointer) load instructions for which the arithmetic difference (i.e., delta) between the effective address and the data value is stable. If such a load instruction incurs a long-latency cache miss during runahead execution, its data value is predicted by subtracting the stable delta from its effective address. This prediction enables the pre-execution of dependent instructions, including load instructions that incur long-latency cache misses. We describe how, why, and for what kind of loads AVD prediction works and evaluate the design tradeoffs in an implementable AVD predictor. Our analysis shows that stable AVDs exist because of patterns in the way data structures are allocated in memory. Our results show that augmenting a runahead processor with a simple, 16-entry AVD predictor improves the average execution time of a set of pointer-intensive applications by 12.1%.

[1]  T. Ozawa,et al.  Cache miss heuristics and preloading techniques for general-purpose programs , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[2]  Stamatis Vassiliadis,et al.  A load-instruction unit for pipelined processors , 1993, IBM J. Res. Dev..

[3]  G. Sohi,et al.  Effective jump-pointer prefetching for linked data structures , 1999, Proceedings of the 26th International Symposium on Computer Architecture (Cat. No.99CB36367).

[4]  Mikko H. Lipasti,et al.  Partial resolution in branch target buffers , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[5]  Brian Fahs,et al.  Microarchitecture optimizations for exploiting memory-level parallelism , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[6]  Jose Renau,et al.  CAVA: Hiding L2 Misses with Checkpoint-Assisted Value Prediction , 2004, IEEE Computer Architecture Letters.

[7]  R. Ronen,et al.  Correlated load-address predictors , 1999, Proceedings of the 26th International Symposium on Computer Architecture (Cat. No.99CB36367).

[8]  James E. Smith,et al.  The predictability of data values , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[9]  José F. Martínez,et al.  Checkpointed early load retirement , 2005, 11th International Symposium on High-Performance Computer Architecture.

[10]  Per Stenström,et al.  A prefetching technique for irregular accesses to linked data structures , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[11]  Maurice V. Wilkes,et al.  The memory gap and the future of high performance memories , 2001, CARN.

[12]  Eric Sprangle,et al.  Increasing processor performance by implementing deeper pipelines , 2002, ISCA.

[13]  Andreas Moshovos,et al.  Dependence based prefetching for linked data structures , 1998, ASPLOS VIII.

[14]  Dirk Grunwald,et al.  Prefetching Using Markov Predictors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[15]  Onur Mutlu,et al.  Runahead Execution: An Effective Alternative to Large Instruction Windows , 2003, IEEE Micro.

[16]  Chia-Lin Yang,et al.  Push vs. pull: data movement for linked data structures , 2000, ICS '00.

[17]  Brad Calder,et al.  Pointer cache assisted prefetching , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[18]  Onur Mutlu,et al.  Techniques for efficient processing in runahead execution engines , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[19]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[20]  A. J. KleinOsowski,et al.  MinneSPEC: A New SPEC Benchmark Workload for Simulation-Based Computer Architecture Research , 2002, IEEE Computer Architecture Letters.

[21]  Huiyang Zhou,et al.  Enhancing memory-level parallelism via recovery-free value prediction , 2005, IEEE Transactions on Computers.

[22]  Dirk Grunwald,et al.  A stateless, content-directed data prefetching mechanism , 2002, ASPLOS X.

[23]  Andrew F. Glew MLP yes! ILP no , 1998, ASPLOS 1998.

[24]  Mikko H. Lipasti,et al.  Value locality and load value prediction , 1996, ASPLOS VII.

[25]  Larry L. Biro,et al.  Power considerations in the design of the Alpha 21264 microprocessor , 1998, Proceedings 1998 Design and Automation Conference. 35th DAC. (Cat. No.98CH36175).

[26]  Brad Calder,et al.  Automatically characterizing large scale program behavior , 2002, ASPLOS X.

[27]  Anne Rogers,et al.  Supporting dynamic data structures on distributed-memory machines , 1995, TOPL.

[28]  Paul Brian Racunas Reducing load latency through memory instruction characterization. , 2003 .

[29]  Todd C. Mowry,et al.  Compiler-based prefetching for recursive data structures , 1996, ASPLOS VII.

[30]  Kai Wang,et al.  Highly accurate data value prediction using hybrid predictors , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[31]  Trevor Mudge,et al.  Improving data cache performance by pre-executing instructions under a cache miss , 1997 .

[32]  Onur Mutlu,et al.  Runahead execution: an alternative to very large instruction windows for out-of-order processors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..