Towards general and exact distributed invalidation

This paper develops and proves an exact distributed invalidation algorithm for programs with general array accesses, arbitrary parallelisation and migratory writes. We present an efficient constructive algorithm that globally combines locally gathered information to insert coherence calls in such a manner to eliminate invalidation traffic without loss of locality and places the minimal number of coherence calls. Experimental results across a range of benchmarks show that it outperforms hardware based sequential and release consistency approaches and decreases application execution time by up to 12%. This is due to eliminating over 99% of the invalidation traffic in all benchmarks. This dramatic reduction in invalidation traffic reduces the total amount of network traffic by up to 28% and the number of network words transmitted by up to 19%.

[1]  David A. Wood,et al.  Dynamic self-invalidation: reducing coherence overhead in shared-memory multiprocessors , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[2]  Kam-Fai Wong,et al.  EDS: A Parallel Computer System for Advanced Information Processing , 1992, PARLE.

[3]  Pen-Chung Yew,et al.  A compiler-directed cache coherence scheme with improved intertask locality , 1994, Proceedings of Supercomputing '94.

[4]  Hoichi Cheong,et al.  Life span strategy—a compiler-based approach to cache coherence , 1992, ICS '92.

[5]  James R. Larus,et al.  Cooperative Shared Memory: Software and Hardware Support for Scalable Multiprocesors , 1992, International Conference on Architectural Support for Programming Languages and Operating Systems.

[6]  Anant Agarwal,et al.  LimitLESS directories: A scalable cache coherence scheme , 1991, ASPLOS IV.

[7]  Alexander V. Veidenbaum,et al.  Compiler-directed cache management in multiprocessors , 1990, Computer.

[8]  Willy Zwaenepoel,et al.  Munin: distributed shared memory based on type-specific memory coherence , 1990, PPOPP '90.

[9]  Pen-Chung Yew,et al.  Compiler Analysis for Cache Coherence: Interprocedural Array Data-Flow Analysis and Its Impact on Cache Performance , 2000, IEEE Trans. Parallel Distributed Syst..

[10]  Michael F. P. O'Boyle,et al.  Compiler Reduction of Invalidation Traffic in Virtual Shared Memory Systems , 1996, Euro-Par, Vol. I.

[11]  Ken Kennedy,et al.  Automatic software cache coherence through vectorization , 1992, ICS '92.

[12]  James R. Larus,et al.  Cachier: A Tool for Automatically Inserting CICO Annotations , 1994, 1994 Internatonal Conference on Parallel Processing Vol. 2.

[13]  D. Lenoski,et al.  The SGI Origin: A ccnuma Highly Scalable Server , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[14]  Per Stenström,et al.  Evaluation of Compiler-Controlled Updating to Reduce Coherence-Miss Penalties in Shared-Memory Multiprocessors , 1999, J. Parallel Distributed Comput..

[15]  Stefanos Kaxiras,et al.  Identification and optimization of sharing patterns for scalable shared-memory multiprocessors , 1998 .

[16]  Ian Watson,et al.  An evaluation of DELTA, a decoupled pre-fetching virtual shared memory system , 1995, Proceedings.Seventh IEEE Symposium on Parallel and Distributed Processing.

[17]  Babak Falsafi,et al.  Memory sharing predictor: the key to a speculative coherent DSM , 1999, ISCA.

[18]  Per Stenström,et al.  Simple compiler algorithms to reduce ownership overhead in cache coherence protocols , 1994, ASPLOS VI.

[19]  Vivek Sarkar,et al.  Array SSA form and its use in parallelization , 1998, POPL '98.

[20]  William Pugh,et al.  The Omega Library interface guide , 1995 .

[21]  Mark D. Hill,et al.  Using prediction to accelerate coherence protocols , 1998, ISCA.

[22]  Mats Brorsson,et al.  An adaptive cache coherence protocol optimized for migratory sharing , 1993, ISCA '93.

[23]  Ralph Grishman,et al.  The NYU Ultracomputer—Designing an MIMD Shared Memory Parallel Computer , 1983, IEEE Transactions on Computers.

[24]  Hermann Hellwagner,et al.  SCI: Scalable Coherent Interface: Architecture and Software for High-Performance Compute Clusters , 1999 .

[25]  Michael F. P. O'Boyle,et al.  A graph based approach to barrier synchronisation minimisation , 1997, ICS '97.

[26]  K. Kennedy,et al.  Cache coherence using local knowledge , 1993, Supercomputing '93.

[27]  James R. Larus,et al.  Cooperative shared memory: software and hardware for scalable multiprocessor , 1992, ASPLOS V.

[28]  Babak Falsafi,et al.  Selective, accurate, and timely self-invalidation using last-touch prediction , 2000, ISCA '00.

[29]  Michael F. P. O'Boyle,et al.  A compiler algorithm to reduce invalidation latency in virtual shared memory systems , 1996, Proceedings of the 1996 Conference on Parallel Architectures and Compilation Technique.

[30]  Michael F. P. O'Boyle,et al.  Exact Distributed Invalidation , 2000, Euro-Par.