Linearizing irregular memory accesses for improved correlated prefetching

This paper introduces the Irregular Stream Buffer (ISB), a prefetcher that targets irregular sequences of temporally correlated memory references. The key idea is to use an extra level of indirection to translate arbitrary pairs of correlated physical addresses into consecutive addresses in a new structural which is visible only to the ISB. This structural address space allows the ISB to organize prefetching meta-data so that it is simultaneously temporally and spatially ordered, which produces technical benefits in terms of coverage, accuracy, and memory traffic overhead. We evaluate the ISB using the Marss full system simulator and the irregular memory-intensive programs of SPEC CPU 2006 for both single-core and multi-core systems. For example, on a single core, the ISB exhibits an average speedup of 23.1% with 93.7% accuracy, compared to 9.9% speedup and 64.2% accuracy for an idealized prefetcher that over-approximates the STMS prefetcher, the previous best temporal stream prefetcher; this ISB prefetcher uses 32 KB of on-chip storage and sees 8.4% memory traffic overhead due to meta-data accesses. We also show that a hybrid prefetcher that combines a stride-prefetcher and an ISB with just 8 KB of on-chip storage exhibits 40.8% speedup and 66.2% accuracy.

[1]  Norman P. Jouppi,et al.  Memory-System Design Considerations for Dynamically-Scheduled Processors , 1997, ISCA.

[2]  Trishul M. Chilimbi Efficient representations and abstractions for quantifying and exploiting data reference locality , 2001, PLDI '01.

[3]  K.J. Nesbit,et al.  AC/DC: an adaptive data cache prefetcher , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[4]  A. Jaleel Memory Characterization of Workloads Using Instrumentation-Driven Simulation A Pin-based Memory Characterization of the SPEC CPU 2000 and SPEC CPU 2006 Benchmark Suites , 2022 .

[5]  Shunfei Chen,et al.  MARSS: A full system simulator for multicore x86 CPUs , 2011, 2011 48th ACM/EDAC/IEEE Design Automation Conference (DAC).

[6]  James R. Larus,et al.  Cache-conscious structure layout , 1999, PLDI '99.

[7]  Yuan Chou,et al.  Low-Cost Epoch-Based Correlation Prefetching for Commercial Applications , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[8]  Josep Torrellas,et al.  Using a user-level memory thread for correlation prefetching , 2002, ISCA.

[9]  Sanjeev Kumar,et al.  Exploiting spatial locality in data caches using spatial footprints , 1998, ISCA.

[10]  Thomas F. Wenisch,et al.  Temporal streams in commercial server applications , 2008, 2008 IEEE International Symposium on Workload Characterization.

[11]  Marcelo Cintra,et al.  Stream chaining: exploiting multiple levels of correlation in data prefetching , 2009, ISCA '09.

[12]  Thomas F. Wenisch,et al.  Spatio-temporal memory streaming , 2009, ISCA '09.

[13]  G. Sohi,et al.  Effective jump-pointer prefetching for linked data structures , 1999, Proceedings of the 26th International Symposium on Computer Architecture (Cat. No.99CB36367).

[14]  Margaret Martonosi,et al.  TCP: tag correlating prefetchers , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[15]  Thomas F. Wenisch,et al.  Temporal streaming of shared memory , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[16]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[17]  Brad Calder,et al.  Pointer cache assisted prefetching , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[18]  Guojing Cong,et al.  Application data prefetching on the IBM Blue Gene/Q supercomputer , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[19]  Thomas F. Wenisch,et al.  Practical off-chip meta-data for temporal memory streaming , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[20]  Babak Falsafi,et al.  Accurate and complexity-effective spatial pattern prediction , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[21]  Brad Calder,et al.  A Decoupled Predictor-Directed Stream Prefetching Architecture , 2003, IEEE Trans. Computers.

[22]  Thomas F. Wenisch,et al.  Spatial Memory Streaming , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[23]  Brad Calder,et al.  SimPoint 3.0: Faster and More Flexible Program Phase Analysis , 2005, J. Instr. Level Parallelism.

[24]  Calvin Lin,et al.  Memory Prefetching Using Adaptive Stream Detection , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[25]  Todd C. Mowry,et al.  Compiler-based prefetching for recursive data structures , 1996, ASPLOS VII.

[26]  Andreas Moshovos,et al.  Dependence based prefetching for linked data structures , 1998, ASPLOS VIII.

[27]  Perry Cheng,et al.  The garbage collection advantage: improving program locality , 2004, OOPSLA.

[28]  Huiyang Zhou,et al.  Combining Local and Global History for High Performance Data Prefetching , 2011, J. Instr. Level Parallelism.

[29]  Dirk Grunwald,et al.  Prefetching Using Markov Predictors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[30]  Onur Mutlu,et al.  Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching systems , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[31]  Jean-Loup Baer,et al.  Effective Hardware Based Data Prefetching for High-Performance Processors , 1995, IEEE Trans. Computers.

[32]  Kathryn S. McKinley,et al.  Guided region prefetching: a cooperative hardware/software approach , 2003, ISCA '03.

[33]  James E. Smith,et al.  Data Cache Prefetching Using a Global History Buffer , 2005, IEEE Micro.

[34]  Babak Falsafi,et al.  Predictor virtualization , 2008, ASPLOS.

[35]  Kei Hiraki,et al.  Access Map Pattern Matching for High Performance Data Cache Prefetch , 2011, J. Instr. Level Parallelism.

[36]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[37]  Wei-Fen Lin,et al.  Filtering superfluous prefetches using density vectors , 2001, Proceedings 2001 IEEE International Conference on Computer Design: VLSI in Computers and Processors. ICCD 2001.

[38]  Thomas F. Wenisch,et al.  Making Address-Correlated Prefetching Practical , 2010, IEEE Micro.

[39]  Wen-mei W. Hwu,et al.  Run-time spatial locality detection and optimization , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[40]  Dirk Grunwald,et al.  A stateless, content-directed data prefetching mechanism , 2002, ASPLOS X.

[41]  Brad Calder,et al.  Using SimPoint for accurate and efficient simulation , 2003, SIGMETRICS '03.

[42]  Norman P. Jouppi,et al.  Cacti 3. 0: an integrated cache timing, power, and area model , 2001 .

[43]  Alan Jay Smith,et al.  Sequential Program Prefetching in Memory Hierarchies , 1978, Computer.

[44]  Erik Brunvand,et al.  Impulse: building a smarter memory controller , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.