Boosting Multiprocessor Program Performance using Optimized Cache Coherence Protocols

While shared-memory multiprocessors emerge on the commercial arena as general high-performance computing platforms for a broad range of applications, their complex memory hierarchies constitute an obstacle for the programmer from a performance point of view. With the goal of making life easier for the application designer, performance optimizations at the machine level for emerging cachecoherent NUMA machines have been explored at Lund University and at University of Southern California. This article exposes the strengths and weaknesses of certain machine-level optimizations based on parallel application case studies and architectural simulations. We focus on what application features they optimize and discuss what implications they have on the machine design as well as what demands they impose on the application software.

[1]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[2]  Anoop Gupta,et al.  The Stanford Dash multiprocessor , 1992, Computer.

[3]  Michel Dubois,et al.  Implementation and evaluation of update-based cache protocols under relaxed memory consistency models , 1995, Future Gener. Comput. Syst..

[4]  James R. Larus,et al.  Application-specific protocols for user-level shared memory , 1994, Proceedings of Supercomputing '94.

[5]  Michel Dubois,et al.  Memory Access Dependencies in Shared-Memory Multiprocessors , 1990, IEEE Trans. Software Eng..

[6]  Anoop Gupta,et al.  The Stanford FLASH multiprocessor , 1994, ISCA '94.

[7]  Michel Dubois,et al.  Essential Misses and Data Traffic in Coherence Protocols , 1995, J. Parallel Distributed Comput..

[8]  Michel Dubois,et al.  Sequential Hardware Prefetching in Shared-Memory Multiprocessors , 1995, IEEE Trans. Parallel Distributed Syst..

[9]  Donald Yeung,et al.  The MIT Alewife machine: architecture and performance , 1995, ISCA '98.

[10]  Per Stenström,et al.  Simple compiler algorithms to reduce ownership overhead in cache coherence protocols , 1994, ASPLOS VI.

[11]  Mats Brorsson,et al.  An adaptive cache coherence protocol optimized for migratory sharing , 1993, ISCA '93.

[12]  Jean-Loup Baer,et al.  A performance study of software and hardware data prefetching schemes , 1994, ISCA '94.

[13]  Per Stenström,et al.  The Cachemire Test Bench A Flexible And Effective Approach For Simulation Of Multiprocessors , 1993, [1993] Proceedings 26th Annual Simulation Symposium.

[14]  Robert J. Fowler,et al.  Adaptive cache coherency for detecting migratory shared data , 1993, ISCA '93.

[15]  P. Stenstrom A survey of cache coherence schemes for multiprocessors , 1990, Computer.

[16]  Per Stenström,et al.  Using Write Caches to Improve Performance of Cache Coherence Protocols in Shared-Memory Multiprocessors , 1995, J. Parallel Distributed Comput..

[17]  Erik Hagersten,et al.  DDM - A Cache-Only Memory Architecture , 1992, Computer.