Memory-Side Prefetching for Linked Data Structures

This paper studies a memory-side prefetching technique to hide latency incurred by inherently serial accesses to linked data structures (LDS). A programmable prefetch engine sits close to memory and traverses LDS independently from the processor. The prefetch engine can run ahead of the processor because of its low latency, high bandwidth path to memory. This allows the prefetch engine to initiate data transfers earlier than the processor and pipeline multiple such transfers over the network. We evaluate the proposed memory-side prefetching scheme for the pointer-intensive Olden benchmark suite, comparing both to a system without any prefetching and one with a state-of-the-art processor-side software prefetching scheme for LDS. For the six benchmarks where LDS memory stall time is significant, the memory-side scheme reduces execution time by an average of 27\% (range of 0\% to 62\%) compared to a system without any prefetching. Compared to processor-side prefetching, the memory-side scheme reduces execution time in the range of 20\% to 50\% for three of the six applications, is about the same for two applications, and is worse by 18\% for one application. We conclude that memory-side prefetching is effective, but a combination of processor- and memory-side prefetching is best and provide a qualitative framework to determine when either scheme should be used. Our results differ from a previous memory-side prefetching study in significant ways, primarily because we perform our comparisons with a state-of-the-art processor-side scheme.

[1]  David Kroft,et al.  Lockup-free instruction fetch/prefetch cache organization , 1998, ISCA '81.

[2]  Sarita V. Adve,et al.  RSIM Reference Manual: Version 1.0 , 1997 .

[3]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[4]  Jaewook Shin,et al.  Mapping Irregular Applications to DIVA, a PIM-based Data-Intensive Architecture , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[5]  Noah Treuhaft,et al.  Scalable Processors in the Billion-Transistor Era: IRAM , 1997, Computer.

[6]  Chia-Lin Yang,et al.  Push vs. pull: data movement for linked data structures , 2000, ICS '00.

[7]  Thomas Alexander,et al.  Distributed prefetch-buffer/cache design for high performance memory systems , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[8]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[9]  Erik Brunvand,et al.  Impulse: building a smarter memory controller , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[10]  Maya Gokhale,et al.  Processing in Memory: The Terasys Massively Parallel PIM Array , 1995, Computer.

[11]  Shlomit S. Pinter,et al.  Tango: a hardware-based data prefetching technique for superscalar processors , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[12]  Douglas J. Joseph,et al.  Prefetching Using Markov Predictors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[13]  Anne Rogers,et al.  Software caching and computation migration in Olden , 1995, PPOPP '95.

[15]  Fong Pong,et al.  Missing the Memory Wall: The Case for Processor/Memory Integration , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[16]  Josep Torrellas,et al.  Speeding up irregular applications in shared-memory multiprocessors: memory binding and group prefetching , 1995, ISCA.

[17]  Gurindar S. Sohi,et al.  Effective jump-pointer prefetching for linked data structures , 1999, ISCA.

[18]  Andreas Nowatzyk,et al.  Missing the Memory Wall: The Case for Processor/Memory Integration , 1996, ISCA.

[19]  Tien-Fu Chen,et al.  Alternative implementations of hybrid branch predictors , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[20]  Luddy Harrison Examination of a memory access classification scheme for pointer-intensive and numeric programs , 1996, ICS '96.

[21]  Andreas Moshovos,et al.  Dependence based prefetching for linked data structures , 1998, ASPLOS VIII.

[22]  Duncan G. Elliott,et al.  Computational Ram: A Memory-simd Hybrid And Its Application To Dsp , 1992, 1992 Proceedings of the IEEE Custom Integrated Circuits Conference.

[23]  Per Stenström,et al.  A prefetching technique for irregular accesses to linked data structures , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[24]  Trevor N. Mudge,et al.  A look at several memory management units, TLB-refill mechanisms, and page table organizations , 1998, ASPLOS VIII.

[25]  Charles J. Hughes,et al.  Prefetching linked data structures in systems with merged dram-logic , 2000 .

[26]  Todd C. Mowry,et al.  Compiler-based prefetching for recursive data structures , 1996, ASPLOS VII.

[27]  D. Burger,et al.  Datascalar Architectures , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[28]  MoshovosAndreas,et al.  Dependence based prefetching for linked data structures , 1998 .

[29]  Mikko H. Lipasti,et al.  Cache miss heuristics and preloading techniques for general-purpose programs , 1995, MICRO 28.

[30]  Todd C. Mowry,et al.  Memory forwarding: enabling aggressive layout optimizations by guaranteeing the safety of data relocation , 1999, ISCA.

[31]  Mikko H. Lipasti,et al.  Partial resolution in branch target buffers , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[32]  Jean-Loup Baer,et al.  Effective Hardware Based Data Prefetching for High-Performance Processors , 1995, IEEE Trans. Computers.

[33]  Sarita V. Adve,et al.  The Impact of Exploiting Instruction-Level Parallelism on Shared-Memory Multiprocessors , 1999, IEEE Trans. Computers.

[34]  Uri C. Weiser,et al.  Correlated load-address predictors , 1999, ISCA.

[35]  Frederic T. Chong,et al.  Active pages: a computation model for intelligent memory , 1998, ISCA.

[36]  Yale N. Patt,et al.  An effective programmable prefetch engine for on-chip caches , 1995, MICRO 1995.