Hardware and Compiler-Directed Cache Coherence in Large-Scale Multiprocessors: Design Considerations and Performance Study

In this paper, we study a hardware-supported, compiler-directed (HSCD) cache coherence scheme, which can be implemented on a large-scale multiprocessor using off-the-shelf microprocessors, such as the Cray T3D. The scheme can be adapted to various cache organizations, including multiword cache lines and byte-addressable architectures. Several system related issues, including critical sections, interthread communication, and task migration have also been addressed. The cost of the required hardware support is minimal and proportional to the cache size. The necessary compiler algorithms, including intra- and interprocedural array data flow analysis, have been implemented on the Polaris parallelizing compiler. From our simulation study using the Perfect Club benchmarks, we found that in spite of the conservative analysis made by the compiler, for four of six benchmark programs tested, the proposed HSCD scheme outperforms the full-map hardware directory scheme up to 70 percent while the hardware scheme outperforms the HSCD scheme in the remaining two applications up to 89 percent. Given its comparable performance and reduced hardware cost, the proposed scheme can be a viable alternative for large-scale multiprocessors such as the Cray T3D, which rely on users to maintain data coherence.

[1]  Ahmed Louri,et al.  A Compiler Directed Cache Coherence Scheme with Fast and Parallel Explicit Invalidation , 1992, ICPP.

[2]  D. K. Poulsen,et al.  Execution-driven tools for parallel simulation of parallel architectures and applications , 1993, Supercomputing '93.

[3]  Pen-Chung Yew,et al.  Program analysis for cache coherence: beyond procedural boundaries , 1996, Proceedings of the 1996 ICPP Workshop on Challenges for Parallel Processing.

[4]  Josep Torrellas,et al.  False Sharing ans Spatial Locality in Multiprocessor Caches , 1994, IEEE Trans. Computers.

[5]  David J. Lilja,et al.  Cache coherence in large-scale shared-memory multiprocessors: issues and comparisons , 1993, CSUR.

[6]  Arthur B. Maccabe,et al.  The program dependence web: a representation supporting control-, data-, and demand-driven interpretation of imperative languages , 1990, PLDI '90.

[7]  Marc Snir,et al.  The Performance of Multistage Interconnection Networks for Multiprocessors , 1983, IEEE Transactions on Computers.

[8]  Alexander V. Veidenbaum,et al.  A version control approach to Cache coherence , 1989, ICS '89.

[9]  Geoffrey C. Fox,et al.  The Perfect Club Benchmarks: Effective Performance Evaluation of Supercomputers , 1989, Int. J. High Perform. Comput. Appl..

[10]  Sang Lyul Min,et al.  A Timestamp-based Cache Coherence Scheme , 1989, ICPP.

[11]  Alexander V. Veidenbaum,et al.  Stale Data Detection and Coherence Enforcement Using Flow Analysis , 1988, ICPP.

[12]  Gordon Bell,et al.  C.mmp: a multi-mini-processor , 1972, AFIPS '72 (Fall, part II).

[13]  Tzi-cker Chiueh,et al.  A compiler-directed distributed shared memory system , 1995, ICS '95.

[14]  Pen-Chung Yew,et al.  Eliminating stale data references through array data-flow analysis , 1996, Proceedings of International Conference on Parallel Processing.

[15]  Pen-Chung Yew,et al.  Hardware and compiler support for cache coherence in large-scale shared-memory multiprocessors , 1996 .

[16]  Paul Feautrier,et al.  A New Solution to Coherence Problems in Multicache Systems , 1978, IEEE Transactions on Computers.

[17]  Hoichi Cheong,et al.  Life span strategy—a compiler-based approach to cache coherence , 1992, ICS '92.

[18]  Tipster Se Cm Architecture Overview , 1996, TIPSTER.

[19]  Sang Lyul Min,et al.  Design and Analysis of a Scalable Cache Coherence Scheme Based on Clocks and Timestamps , 1992, IEEE Trans. Parallel Distributed Syst..

[20]  Edward S. Davidson,et al.  The Cedar system and an initial performance study , 1998, ISCA '98.

[21]  K. Kennedy,et al.  Cache coherence using local knowledge , 1993, Supercomputing '93.

[22]  Thomas G. Robertazzi,et al.  The Performance of Multistage Interconnection Networks for Multiprocessors , 1993 .

[23]  Pen-Chung Yew,et al.  Compiler and Hardware Support for Cache Coherence in Large-Scale Multiprocessors: Design Considerations and Performance Study , 1996, International Symposium on Computer Architecture.

[24]  Larry Rudolph,et al.  Issues Related to MIMD Shared-memory Computers: The NYU Ultracomputer Approach , 1985, ISCA.

[25]  Kevin P. McAuliffe,et al.  RP3 Processor-Memory Element , 1985, ICPP.

[26]  Rudolf Eigenmann,et al.  Polaris: A New-Generation Parallelizing Compiler for MPPs , 1993 .

[27]  Qing Yang,et al.  CAT - caching address tags - a technique for reducing area cost of on-chip caches , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[28]  Thomas J. LeBlanc,et al.  Adjustable block size coherent caches , 1992, ISCA '92.

[29]  Yung-Chin Chen,et al.  Cache Design and Performance in a Large-Scale Shared-Memory Multiprocessor System , 1993 .

[30]  Michael J. Flynn,et al.  An area model for on-chip memories and its application , 1991 .

[31]  Alexander V. Veidenbaum,et al.  A cache coherence scheme with fast selective invalidation , 1988, ISCA '88.

[32]  Alexander V. Veidenbaum,et al.  A Compiler-Assisted Cache Coherence Solution for Multiprcessors , 1986, ICPP.

[33]  Pen-Chung Yew,et al.  Compiler Analysis for Cache Coherence: Interprocedural Array Data-Flow Analysis and Its Impact on Cache Performance , 2000, IEEE Trans. Parallel Distributed Syst..

[34]  James K. Archibald,et al.  An economical solution to the cache coherence problem , 1984, ISCA '84.

[35]  Mary K. Vernon,et al.  Comparison of hardware and software cache coherence schemes , 1991, ISCA '91.

[36]  Cathy May,et al.  The PowerPC Architecture: A Specification for a New Family of RISC Processors , 1994 .

[37]  Pen-Chung Yew,et al.  A compiler-directed cache coherence scheme with improved intertask locality , 1994, Proceedings of Supercomputing '94.

[38]  Donald Yeung,et al.  THE MIT ALEWIFE MACHINE: A LARGE-SCALE DISTRIBUTED-MEMORY MULTIPROCESSOR , 1991 .

[39]  Tzi-cker Chiueh,et al.  A Generational Algorithm to Multiprocessor Cache Coherence , 1993, 1993 International Conference on Parallel Processing - ICPP'93.

[40]  Dean M. Tullsen,et al.  Limitations Of Cache Prefetching On A Bus-based Multiprocessor , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.