Data Prefetching for Non-Linear Memory References

Previous research on intelligent data prefetch schemes only focuses on array accesses with constant strides. For many applications such as those in C/C++, object-oriented programming or functional programming, data are mainly referenced by non-arrays through pointers, linked-lists, and tree structures. In this case, most (if not all) of these intelligent schemes lose their effectiveness. To recapture the regularity of data access method and to regain the system performance loss, a new data prefetch scheme, called the Reference Value Prediction Caching (RVPC), is proposed in this paper. Instead of the memory address sequence, this RVPC scheme is driven by the information obtained from the symbolic instruction code. Regularity of data access will be identified and learnt by some novel hardware proposed in the RVPC scheme. Since no architecture nor compiler support is required by the RVPC scheme, this makes the scheme very attractive to processor implementation. Our simulation showed that significant reduction in memory latency (on average over 50%) can be expected from the RVPC scheme, especially those applications with pointers.

[1]  Chau-Wen Tseng,et al.  Locality Optimizations for Multi-Level Caches , 1999, SC.

[2]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[3]  M. Castells Multilevel tiling for non-rectangular interation spaces , 1999 .

[4]  Chau-Wen Tseng,et al.  A Comparison of Compiler Tiling Algorithms , 1999, CC.

[5]  Jack L. Lo,et al.  Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[6]  Jean-Loup Baer,et al.  A performance study of software and hardware data prefetching schemes , 1994, ISCA '94.

[7]  Zhiyuan Li,et al.  New tiling techniques to improve cache temporal locality , 1999, PLDI '99.

[8]  K. Sundaramoorthy,et al.  Slipstream processors: improving both performance and fault tolerance , 2000, SIGP.

[9]  Jeffrey F. Naughton,et al.  Cache Conscious Algorithms for Relational Query Processing , 1994, VLDB.

[10]  Kathryn S. McKinley,et al.  Tile size selection using cache organization and data layout , 1995, PLDI '95.

[11]  Theo Ungerer,et al.  Identifying Bottlenecks in a Multithreaded Superscalar Microprocessor , 1996, Euro-Par, Vol. II.

[12]  Mithuna Thottethodi,et al.  Nonlinear array layouts for hierarchical memory systems , 1999, ICS '99.

[13]  Chau-Wen Tseng,et al.  Eliminating conflict misses for high performance architectures , 1998, ICS '98.

[14]  Keshav Pingali,et al.  An experimental evaluation of tiling and shackling for memory hierarchy management , 1999, ICS '99.

[15]  Ken Kennedy,et al.  Software methods for improvement of cache performance on supercomputer applications , 1989 .

[16]  Chau-Wen Tseng,et al.  Data transformations for eliminating conflict misses , 1998, PLDI.

[17]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[18]  Jacqueline Chame,et al.  A tile selection algorithm for data locality and cache interference , 1999, ICS '99.

[19]  Wei Li,et al.  Unifying data and control transformations for distributed shared-memory machines , 1995, PLDI '95.

[20]  Alexander V. Veidenbaum,et al.  Compiler-directed data prefetching in multiprocessors with memory hierarchies , 1990 .

[21]  Hiroshi Nakamura,et al.  Augmenting Loop Tiling with Data Alignment for Improved Cache Performance , 1999, IEEE Trans. Computers.

[22]  Tien-Fu Chen,et al.  Alternative implementations of hybrid branch predictors , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[23]  Todd C. Mowry,et al.  Compiler-based prefetching for recursive data structures , 1996, ASPLOS VII.

[24]  Olivier Temam,et al.  To copy or not to copy: A compile-time technique for assessing when data copying should be used to eliminate cache conflicts , 1993, Supercomputing '93. Proceedings.

[25]  Luddy Harrison Examination of a memory access classification scheme for pointer-intensive and numeric programs , 1996, ICS '96.

[26]  Jean-Loup Baer,et al.  Effective Hardware Based Data Prefetching for High-Performance Processors , 1995, IEEE Trans. Computers.

[27]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[28]  Vivek Sarkar,et al.  On Estimating and Enhancing Cache Effectiveness , 1991, LCPC.

[29]  Jean-Loup Baer,et al.  Reducing memory latency via non-blocking and prefetching caches , 1992, ASPLOS V.

[30]  Per Stenström,et al.  Evaluation of Hardware-Based Stride and Sequential Prefetching in Shared-Memory Multiprocessors , 1996, IEEE Trans. Parallel Distributed Syst..

[31]  Graham R. Nudd,et al.  Analytical Modeling of Set-Associative Cache Behavior , 1999, IEEE Trans. Computers.

[32]  Tien-Fu Chen,et al.  Data prefetching for high-performance processors , 1993 .

[33]  Olivier Temam,et al.  Cache interference phenomena , 1994, SIGMETRICS.

[34]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[35]  Mahmut T. Kandemir,et al.  A Linear Algebra Framework for Automatic Determination of Optimal Data Layouts , 1999, IEEE Trans. Parallel Distributed Syst..

[36]  Todd C. Mowry,et al.  Automatic Compiler-Inserted Prefetching for Pointer-Based Applications , 1999, IEEE Trans. Computers.

[37]  D. Tullsen,et al.  ILP versus TLP on SMT , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[38]  David A. Wood,et al.  Cache profiling and the SPEC benchmarks: a case study , 1994, Computer.

[39]  Douglas J. Joseph,et al.  Prefetching Using Markov Predictors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[40]  Kozo Kimura,et al.  An elementary processor architecture with simultaneous instruction issuing from multiple threads , 1992, ISCA '92.

[41]  Karim Esseghir Improving data locality for caches , 1993 .

[42]  Mithuna Thottethodi,et al.  Recursive array layouts and fast parallel matrix multiplication , 1999, SPAA '99.

[43]  Mahmut T. Kandemir,et al.  A compiler algorithm for optimizing locality in loop nests , 1997, ICS '97.

[44]  Todd C. Mowry,et al.  Tolerating latency through software-controlled data prefetching , 1994 .

[45]  Sharad Malik,et al.  Cache miss equations: a compiler framework for analyzing and tuning memory behavior , 1999, TOPL.

[46]  Ken Kennedy,et al.  Software prefetching , 1991, ASPLOS IV.

[47]  Jeremy D. Frens,et al.  Morton-order Matrices Deserve Compilers ’ Support Technical Report 533 , 1999 .

[48]  Anoop Gupta,et al.  Tolerating Latency Through Software-Controlled Prefetching in Shared-Memory Multiprocessors , 1991, J. Parallel Distributed Comput..

[49]  Janak H. Patel,et al.  Data prefetching in multiprocessor vector cache memories , 1991, ISCA '91.

[50]  Chau-Wen Tseng,et al.  Improving data locality with loop transformations , 1996, TOPL.

[51]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[52]  David S. Wise Ahnentafel Indexing into Morton-Ordered Arrays, or Matrix Locality for Free , 2000, Euro-Par.

[53]  Scott A. Mahlke,et al.  Data access microarchitectures for superscalar processors with compiler-assisted data prefetching , 1991, MICRO 24.

[54]  Mahmut T. Kandemir,et al.  Improving Cache Locality by a Combination of Loop and Data Transformation , 1999, IEEE Trans. Computers.