An Integrated Hardware/Software Data Prefetching Scheme for Shared-Memory Multiprocessors1

Both hardware and software prefetching have been shown to be effective in tolerating the large memory latencies inherent in shared-memory multiprocessors; however, both types of prefetching have their shortcomings. While software schemes require less hardware support than hardware schemes, they must generate address calculation instructions and a prefetch instruction for each datum that needs to be prefetched. Hardware schemes, however, must become progressively more complex to be able to compute data access strides and to increase the prefetching lookahead. In this paper, we propose an integrated hardware/software prefetching method that uses simple hardware that can handle most data accesses and software prefetching for the few remaining accesses. A compile time algorithm analyzes the access streams formed by array references and determines sequences of consecutive memory accesses to an access stream that can be prefetched by the hardware mechanism. This analysis is based on the relative memory locations of consecutive accesses to an access stream and the number of intervening data references between consecutive accesses to an access stream. In addition, the prefetching lookahead can be set separately for each access stream. Our approach yields an effective scheme that minimizes both CPU overhead and hardware costs. Execution-driven simulations show our method to be very effective.

[1]  Scott A. Mahlke,et al.  Data access microarchitectures for superscalar processors with compiler-assisted data prefetching , 1991, MICRO 24.

[2]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[3]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[4]  Jean-Loup Baer,et al.  An effective on-chip preloading scheme to reduce data access penalty , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[5]  Ken Kennedy,et al.  Software prefetching , 1991, ASPLOS IV.

[6]  Anoop Gupta,et al.  Tolerating Latency Through Software-Controlled Prefetching in Shared-Memory Multiprocessors , 1991, J. Parallel Distributed Comput..

[7]  T. Mowry,et al.  Comparative evaluation of latency reducing and tolerating techniques , 1991, [1991] Proceedings. The 18th Annual International Symposium on Computer Architecture.

[8]  Richard E. Hank,et al.  An efficient architecture for loop based data preloading , 1992, MICRO 1992.

[9]  Alexander V. Veidenbaum,et al.  An effective write policy for software coherence schemes , 1992, Proceedings Supercomputing '92.

[10]  Steven A. Moyer,et al.  Access Ordering and Effective Memory Bandwidth , 1993 .

[11]  William Jalby,et al.  A Quantitative Algorithm for Data Locality Optimization , 1991, Code Generation.

[12]  J.W.C. Fu,et al.  Stride Directed Prefetching In Scalar Processors , 1992, [1992] Proceedings the 25th Annual International Symposium on Microarchitecture MICRO 25.

[13]  Alan Jay Smith,et al.  Cache Memories , 1982, CSUR.

[14]  Yvon Jégou,et al.  Using virtual lines to enhance locality exploitation , 1994, ICS '94.

[15]  Tien-Fu Chen,et al.  Data prefetching for high-performance processors , 1993 .

[16]  H. Levy,et al.  An architecture for software-controlled data prefetching , 1991, [1991] Proceedings. The 18th Annual International Symposium on Computer Architecture.

[17]  Alexander V. Veidenbaum,et al.  An Integrated Hardware/Software Data Prefetching Scheme for Shared-Memory Multiprocessors1 , 1994, 1994 Internatonal Conference on Parallel Processing Vol. 2.

[18]  Alfred V. Aho,et al.  Principles of Compiler Design (Addison-Wesley series in computer science and information processing) , 1977 .

[19]  Ivan Sklenár Prefetch unit for vector operations on scalar computers , 1992, ISCA.

[20]  Pen-Chung Yew,et al.  : Data Prefetching In Shared Memory Multiprocessors , 1987, ICPP.

[21]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[22]  Yung-Chin Chen,et al.  Cache Design and Performance in a Large-Scale Shared-Memory Multiprocessor System , 1993 .

[23]  Anoop Gupta,et al.  Comparative evaluation of latency reducing and tolerating techniques , 1991, ISCA '91.

[24]  David J. Lilja,et al.  The Impact of Parallel Loop Scheduling Strategies on Prefetching in a Shared Memory Multiprocessor , 1994, IEEE Trans. Parallel Distributed Syst..

[25]  Alfred V. Aho,et al.  Principles of Compiler Design , 1977 .

[26]  Chi-Hung Chi Compiler Optimization Technique for Data Cache Prefetching Using a Small CAM Array , 1994, 1994 International Conference on Parallel Processing Vol. 1.

[27]  Manuel E. Benitez,et al.  Code generation for streaming: an access/execute mechanism , 1991, ASPLOS IV.

[28]  Michel Dubois,et al.  International Conference on Parallel Processing Fixed and Adaptive Sequential Prefetching in Shared Memory Multiprocessors , 2006 .

[29]  Janak H. Patel,et al.  Stride directed prefetching in scalar processors , 1992, MICRO.

[30]  Yvon Jégou,et al.  Speculative prefetching , 1993, ICS '93.

[31]  Jean-Loup Baer,et al.  A performance study of software and hardware data prefetching schemes , 1994, ISCA '94.

[32]  Henry M. Levy,et al.  An architecture for software-controlled data prefetching , 1991, ISCA '91.

[33]  Janak H. Patel,et al.  Data prefetching in multiprocessor vector cache memories , 1991, ISCA '91.

[34]  Alexander V. Veidenbaum,et al.  Compiler-directed data prefetching in multiprocessors with memory hierarchies , 1990, ICS '90.

[35]  Dean M. Tullsen,et al.  Limitations of cache prefetching on a bus-based multiprocessor , 1993, ISCA '93.

[36]  Alexander V. Veidenbaum,et al.  Compiler-directed data prefetching in multiprocessors with memory hierarchies , 1990 .

[37]  Alexander V. Veidenbaum,et al.  Comparison and analysis of software and directory coherence schemes , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[38]  Dean M. Tullsen,et al.  Limitations Of Cache Prefetching On A Bus-based Multiprocessor , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[39]  Utpal Banerjee,et al.  Dependence analysis for supercomputing , 1988, The Kluwer international series in engineering and computer science.

[40]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and pre , 1990, ISCA 1990.

[41]  Hye-yeon Cheong Compiler-directed cache coherence strategies for large-scale sha , 1990 .

[42]  Janak H. Patel,et al.  Stride directed prefetching in scalar processors , 1992, MICRO 1992.