Efficient integration of compiler-directed cache coherence and data prefetching

Cache coherence enforcement and memory latency reduction and hiding are very important and challenging problems in the design of large-scale distributed shared-memory (DSM) multiprocessors. We propose an integrated framework to solve these problems through a compiler-directed cache coherence scheme called the Cache Coherence with Data Prefetching (CCDP) scheme. The CCDP scheme enforces cache coherence by prefetching the potentially stale references in a parallel program. It also prefetches the nonstale references to hide their memory latencies. To optimize the performance of the CCDP scheme, some prefetch hardware support is provided to efficiently handle these two forms of data prefetching operations. We also developed the compiler techniques utilized by the CCDP scheme for stale reference detection, prefetch target analysis and prefetch scheduling. We evaluated the performance of the CCDP scheme via execution-driven simulations of several applications from the SPEC CFP95 and the Perfect benchmark suites. The simulation results show that the CCDP scheme provides significant performance improvements for the applications studied.

[1]  Hock-Beng Lim,et al.  Techniques for Compiler-Directed Cache Coherence , 1996, IEEE Parallel Distributed Technol. Syst. Appl..

[2]  James K. Archibald,et al.  An economical solution to the cache coherence problem , 1984, ISCA '84.

[3]  P.-C. Yew,et al.  Techniques for compiler-directed cache coherence : Parallel architectures , 1996 .

[4]  Geoffrey C. Fox,et al.  The Perfect Club Benchmarks: Effective Performance Evaluation of Supercomputers , 1989, Int. J. High Perform. Comput. Appl..

[5]  Pen-Chung Yew,et al.  Compiler and Hardware Support for Cache Coherence in Large-Scale Multiprocessors: Design Considerations and Performance Study , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[6]  Jean-Loup Baer,et al.  Effective Hardware Based Data Prefetching for High-Performance Processors , 1995, IEEE Trans. Computers.

[7]  Todd C. Mowry,et al.  Tolerating latency through software-controlled data prefetching , 1994 .

[8]  Marc Snir,et al.  The Performance of Multistage Interconnection Networks for Multiprocessors , 1983, IEEE Transactions on Computers.

[9]  Rudolf Eigenmann,et al.  Polaris: A New-Generation Parallelizing Compiler for MPPs , 1993 .

[10]  Richard L. Wexelblat,et al.  Proceedings of the fifth international conference on Architectural support for programming languages and operating systems , 1992, ASPLOS 1992.

[11]  Yung-Chin Chen,et al.  Cache Design and Performance in a Large-Scale Shared-Memory Multiprocessor System , 1993 .

[12]  Per Stenström,et al.  Evaluation of Hardware-Based Stride and Sequential Prefetching in Shared-Memory Multiprocessors , 1996, IEEE Trans. Parallel Distributed Syst..

[13]  Pen-Chung Yew,et al.  Execution-driven tools for parallel simulation of parallel architectures and applications , 1993, Supercomputing '93. Proceedings.

[14]  David Bernstein,et al.  Compiler techniques for data prefetching on the PowerPC , 1995, PACT.

[15]  Hock-Beng Lim,et al.  Maintaining Cache Coherence through Compiler-Directed Data Prefetching , 1998, J. Parallel Distributed Comput..

[16]  Michael E. Wolf,et al.  Improving locality and parallelism in nested loops , 1992 .