A general framework for prefetch scheduling in linked data structures and its application to multi-chain prefetching

Pointer-chasing applications tend to traverse composite data structures consisting of multiple independent pointer chains. While the traversal of any single pointer chain leads to the serialization of memory operations, the traversal of independent pointer chains provides a source of memory parallelism. This article investigates exploiting such interchain memory parallelism for the purpose of memory latency tolerance, using a technique called multi--chain prefetching. Previous works [Roth et al. 1998;Roth and Sohi 1999] have proposed prefetching simple pointer-based structures in a multi--chain fashion. However, our work enables multi--chain prefetching for arbitrary data structures composed of lists, trees, and arrays.This article makes five contributions in the context of multi--chain prefetching. First, we introduce a framework for compactly describing linked data structure (LDS) traversals, providing the data layout and traversal code work information necessary for prefetching. Second, we present an off-line scheduling algorithm for computing a prefetch schedule from the LDS descriptors that overlaps serialized cache misses across separate pointer-chain traversals. Our analysis focuses on static traversals. We also propose using speculation to identify independent pointer chains in dynamic traversals. Third, we propose a hardware prefetch engine that traverses pointer-based data structures and overlaps multiple pointer chains according to the computed prefetch schedule. Fourth, we present a compiler that extracts LDS descriptors via static analysis of the application source code, thus automating multi--chain prefetching. Finally, we conduct an experimental evaluation of compiler-instrumented multi--chain prefetching and compare it against jump pointer prefetching [Luk and Mowry 1996], prefetch arrays [Karlsson et al. 2000], and predictor-directed stream buffers (PSB) [Sherwood et al. 2000].Our results show compiler-instrumented multi--chain prefetching improves execution time by 40% across six pointer-chasing kernels from the Olden benchmark suite [Rogers et al. 1995], and by 3% across four SPECint2000 benchmarks. Compared to jump pointer prefetching and prefetch arrays, multi--chain prefetching achieves 34% and 11% higher performance for the selected Olden and SPECint2000 benchmarks, respectively. Compared to PSB, multi--chain prefetching achieves 27% higher performance for the selected Olden benchmarks, but PSB outperforms multi--chain prefetching by 0.2% for the selected SPECint2000 benchmarks. An ideal PSB with an infinite Markov predictor achieves comparable performance to multi--chain prefetching, coming within 6% across all benchmarks. Finally, speculation can enable multi--chain prefetching for some dynamic traversal codes, but our technique loses its effectiveness when the pointer-chain traversal order is highly dynamic.

[1]  John Paul Shen,et al.  Dynamic speculative precomputation , 2001, MICRO.

[2]  Richard E. Kessler,et al.  Evaluating stream buffers as a secondary cache replacement , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[3]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and pre , 1990, ISCA 1990.

[4]  Yale N. Patt,et al.  An effective programmable prefetch engine for on-chip caches , 1995, MICRO 1995.

[5]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[6]  Gurindar S. Sohi,et al.  Speculative data-driven multithreading , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[7]  Henry M. Levy,et al.  An Architecture for Software-Controlled Data Prefetching , 1991, ISCA.

[8]  Henry M. Levy,et al.  An architecture for software-controlled data prefetching , 1991, ISCA '91.

[9]  Donald Yeung,et al.  Learning-Based SMT Processor Resource Distribution via Hill-Climbing , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[10]  Donald Yeung,et al.  Design and evaluation of compiler algorithms for pre-execution , 2002, ASPLOS X.

[11]  Douglas J. Joseph,et al.  Prefetching Using Markov Predictors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[12]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[13]  Monica S. Lam,et al.  Maximizing Multiprocessor Performance with the SUIF Compiler , 1996, Digit. Tech. J..

[14]  Gurindar S. Sohi,et al.  Effective jump-pointer prefetching for linked data structures , 1999, ISCA.

[15]  Todd C. Mowry,et al.  Tolerating latency in multiprocessors through compiler-inserted prefetching , 1998, TOCS.

[16]  Jean-Loup Baer,et al.  Effective Hardware Based Data Prefetching for High-Performance Processors , 1995, IEEE Trans. Computers.

[17]  Donald Yeung,et al.  Optimizing SMT Processors for High Single-Thread Performance , 2003, J. Instr. Level Parallelism.

[18]  Ken Kennedy,et al.  Software prefetching , 1991, ASPLOS IV.

[19]  Anoop Gupta,et al.  Tolerating Latency Through Software-Controlled Prefetching in Shared-Memory Multiprocessors , 1991, J. Parallel Distributed Comput..

[20]  Susan L. Graham,et al.  Gprof: A call graph execution profiler , 1982, SIGPLAN '82.

[21]  Daniel A. Connors,et al.  Compiler-directed content-aware prefetching for dynamic data structures , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.

[22]  David Kroft,et al.  Lockup-free instruction fetch/prefetch cache organization , 1998, ISCA '81.

[23]  Dolores R. Wallace,et al.  Using the Unravel Program Slicing Tool to Evaluate High Integrity Software , 1997 .

[24]  Chi-Keung Luk,et al.  Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors , 2001, Proceedings 28th Annual International Symposium on Computer Architecture.

[25]  John Paul Shen,et al.  Post-pass binary adaptation for software-based speculative precomputation , 2002, PLDI '02.

[26]  Sarita V. Adve,et al.  Code transformations to improve memory parallelism , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[27]  Donald Yeung,et al.  Multi-chain prefetching: effective exploitation of inter-chain memory parallelism for pointer-chasing codes , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[28]  Guang R. Gao,et al.  Speculative Prefetching of Induction Pointers , 2001, CC.

[29]  Andreas Moshovos,et al.  Dependence based prefetching for linked data structures , 1998, ASPLOS VIII.

[30]  John Paul Shen,et al.  Dynamic speculative precomputation , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[31]  Luddy Harrison Examination of a memory access classification scheme for pointer-intensive and numeric programs , 1996, ICS '96.

[32]  Janak H. Patel,et al.  Stride directed prefetching in scalar processors , 1992, MICRO.

[33]  Olivier Temam Streaming Prefetch , 1996, Euro-Par, Vol. II.

[34]  Christopher Hughes,et al.  Speculative precomputation: long-range prefetching of delinquent loads , 2001, ISCA 2001.

[35]  Todd C. Mowry,et al.  Compiler-based prefetching for recursive data structures , 1996, ASPLOS VII.

[36]  Dionisios N. Pnevmatikatos,et al.  Slice-processors: an implementation of operation-based prediction , 2001, ICS '01.

[37]  KimDongkeun,et al.  A general framework for prefetch scheduling in linked data structures and its application to multi-chain prefetching , 2004 .

[38]  Anne Rogers,et al.  Supporting dynamic data structures on distributed-memory machines , 1995, TOPL.

[39]  Tien-Fu Chen,et al.  Alternative implementations of hybrid branch predictors , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[40]  Per Stenström,et al.  A prefetching technique for irregular accesses to linked data structures , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[41]  Brad Calder,et al.  Predictor-directed stream buffers , 2000, MICRO 33.

[42]  Chia-Lin Yang,et al.  Push vs. pull: data movement for linked data structures , 2000, ICS '00.

[43]  Tzi-cker Chiueh,et al.  Sunder: a programmable hardware prefetch architecture for numerical loops , 1994, Proceedings of Supercomputing '94.

[44]  Craig Zilles,et al.  Execution-based prediction using speculative slices , 2001, ISCA 2001.

[45]  Jack L. Lo,et al.  Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).