A Primer on Memory Consistency and Cache Coherence

Many modern computer systems and most multicore chips (chip multiprocessors) support shared memory in hardware. In a shared memory system, each of the processor cores may read and write to a single shared address space. For a shared memory machine, the memory consistency model defines the architecturally visible behavior of its memory system. Consistency definitions provide rules about loads and stores (or memory reads and writes) and how they act upon memory. As part of supporting a memory consistency model, many machines also provide cache coherence protocols that ensure that multiple cached copies of data are kept up-to-date. The goal of this primer is to provide readers with a basic understanding of consistency and coherence. This understanding includes both the issues that must be solved as well as a variety of solutions. We present both highlevel concepts as well as specific, concrete examples from real-world systems. Table of Contents: Preface / Introduction to Consistency and Coherence / Coherence Basics / Memory Consistency Motivation and Sequential Consistency / Total Store Order and the x86 Memory Model / Relaxed Memory Consistency / Coherence Protocols / Snooping Coherence Protocols / Directory Coherence Protocols / Advanced Topics in Coherence / Author Biographies

[1]  H. E. Lokay IEEE "Fellow" , 1981 .

[2]  Pat Conway,et al.  The AMD Opteron Northbridge Architecture , 2007, IEEE Micro.

[3]  Kenneth C. Yeager The Mips R10000 superscalar microprocessor , 1996, IEEE Micro.

[4]  Barton P. Miller,et al.  Detecting data races on weak memory systems , 1991, ISCA '91.

[5]  Sarita V. Adve,et al.  Shared Memory Consistency Models: A Tutorial , 1996, Computer.

[6]  James R. Larus,et al.  Cooperative shared memory: software and hardware for scalable multiprocessors , 1993, TOCS.

[7]  Jade Alglave,et al.  Fences in Weak Memory Models , 2010, CAV.

[8]  Wen-Hann Wang,et al.  On the inclusion properties for multi-level cache hierarchies , 1988, ISCA '88.

[9]  Patricia J. Teller Translation-lookaside buffer consistency , 1990, Computer.

[10]  D. Lenoski,et al.  The SGI Origin: A ccnuma Highly Scalable Server , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[11]  Anoop Gupta,et al.  The directory-based cache coherence protocol for the DASH multiprocessor , 1990, ISCA '90.

[12]  Michael C. Browne,et al.  The S3.mp scalable shared memory multiprocessor , 1994, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

[13]  Alan Jay Smith,et al.  A class of compatible cache consistency protocols and their support by the IEEE futurebus , 1986, ISCA '86.

[14]  Laxmi N. Bhuyan,et al.  A dynamic cache sub-block design to reduce false sharing , 1995, Proceedings of ICCD '95 International Conference on Computer Design. VLSI in Computers and Processors.

[15]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[16]  Milo M. K. Martin,et al.  Specifying and Verifying a Broadcast and a Multicast Snooping Cache Coherence Protocol , 2002, IEEE Trans. Parallel Distributed Syst..

[17]  Anoop Gupta,et al.  Memory consistency and event ordering in scalable shared-memory multiprocessors , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[18]  Michel Cekleov,et al.  Formal Specification of Memory Models , 1992 .

[19]  Amir Roth Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load Optimization , 2005, ISCA 2005.

[20]  Anoop Gupta,et al.  Memory consistency and event ordering in scalable shared-memory multiprocessors , 1990, ISCA '90.

[21]  Milo M. K. Martin,et al.  Token Coherence: decoupling performance and correctness , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[22]  Rajeev Alur,et al.  Generating Litmus Tests for Contrasting Memory Consistency Models , 2010, CAV.

[23]  Brian W. Kernighan,et al.  The C Programming Language , 1978 .

[24]  Mark D. Hill,et al.  Coherence Ordering for Ring-based Chip Multiprocessors , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[25]  Robert J. Fowler,et al.  Adaptive cache coherency for detecting migratory shared data , 1993, ISCA '93.

[26]  Michel Dubois,et al.  Memory access buffering in multiprocessors , 1998, ISCA '98.

[27]  M. Leese,et al.  10. REFERENCES , 1983, The Novel in the Spanish Silver Age.

[28]  Alan E. Charlesworth,et al.  Starfire: extending the SMP envelope , 1998, IEEE Micro.

[29]  Steven L. Scott,et al.  Synchronization and communication in the T3E multiprocessor , 1996, ASPLOS VII.

[30]  T. E. Patterson Out Of Order , 1993 .

[31]  Kunle Olukotun,et al.  Transactional memory coherence and consistency , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[32]  J. Rothnie,et al.  The KSR 1: bridging the gap between shared memory and MPPs , 1993, Digest of Papers. Compcon Spring.

[33]  Mikko H. Lipasti,et al.  Memory Ordering: A Value-Based Approach , 2004, ISCA 2004.

[34]  Michel Dubois,et al.  Cache Coherence on a Slotted Ring , 1991, ICPP.

[35]  Albert Meixner,et al.  Dynamic Verification of Memory Consistency in Cache-Coherent Multithreaded Computer Architectures , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[36]  Chuan Yi Tang,et al.  A 2.|E|-Bit Distributed Algorithm for the Directed Euler Trail Problem , 1993, Inf. Process. Lett..

[37]  Eric Williams,et al.  Performance optimizations, implementation, and verification of the SGI Challenge multiprocessor , 1994, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

[38]  Karlheinz Spitz,et al.  TABLE 8.1 , 2008 .

[39]  Mark D. Hill,et al.  Virtual hierarchies to support server consolidation , 2007, ISCA '07.

[40]  Anoop Gupta,et al.  Two Techniques to Enhance the Performance of Memory Consistency Models , 1991, ICPP.

[41]  Mikko H. Lipasti,et al.  Exceeding the dataflow limit via value prediction , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[42]  Peter Sewell,et al.  A Better x86 Memory Model: x86-TSO , 2009, TPHOLs.

[43]  Sarita V. Adve,et al.  Designing memory consistency models for shared-memory multiprocessors , 1993 .

[44]  Sudhir Gupta,et al.  Case Studies , 2013, Journal of Clinical Immunology.

[45]  James Archibald,et al.  An economical solution to the cache coherence problem , 1984, ISCA 1984.

[46]  Mark Horowitz,et al.  An evaluation of directory schemes for cache coherence , 1998, ISCA '98.

[47]  Jeremy Manson,et al.  The Java memory model , 2005, POPL '05.

[48]  Anoop Gupta,et al.  The Stanford Dash multiprocessor , 1992, Computer.

[49]  William W. Collier,et al.  Reasoning about parallel architectures , 1992 .

[50]  Hans-Juergen Boehm,et al.  Foundations of the C++ concurrency memory model , 2008, PLDI '08.

[51]  T. N. Vijaykumar,et al.  Is SC + ILP = RC? , 1999, ISCA.

[52]  David Kroft,et al.  Lockup-free instruction fetch/prefetch cache organization , 1998, ISCA '81.

[53]  Leslie Lamport,et al.  How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs , 2016, IEEE Transactions on Computers.

[54]  James R. Goodman,et al.  Cache Consistency and Sequential Consistency , 1991 .

[55]  Milo M. K. Martin,et al.  Bandwidth adaptive snooping , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[56]  Mats Brorsson,et al.  An adaptive cache coherence protocol optimized for migratory sharing , 1993, ISCA '93.

[57]  David J. Lilja,et al.  So many states, so little time: verifying memory coherence in the Cray X1 , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[58]  Stuart J. Russell,et al.  Do the right thing , 1991 .

[59]  William Pugh The Java memory model is fatally flawed , 2000 .

[60]  Milo M. K. Martin,et al.  Timestamp snooping: an approach for extending SMPs , 2000, ASPLOS.

[61]  Andreas Reuter,et al.  Transaction Processing: Concepts and Techniques , 1992 .

[62]  Anoop Gupta,et al.  Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes , 1990, ICPP.

[63]  Balaram Sinharoy,et al.  POWER5 system microarchitecture , 2005, IBM J. Res. Dev..

[64]  Francesco Zappa Nardelli,et al.  x86-TSO: a rigorous and usable programmer's model for x86 multiprocessors , 2010, Commun. ACM.

[65]  Erik Hagersten,et al.  WildFire: a scalable path for SMPs , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[66]  Mark D. Hill,et al.  Weak ordering—a new definition , 1998, ISCA '98.

[67]  Josep Torrellas,et al.  BulkSC: bulk enforcement of sequential consistency , 2007, ISCA '07.

[68]  A. P. Cawadias Starvation , 1943 .

[69]  Kourosh Gharachorloo,et al.  Architecture and design of AlphaServer GS320 , 2000, SIGP.

[70]  Niraj K. Jha,et al.  In-Network Snoop Ordering (INSO): Snoopy coherence on unordered interconnects , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[71]  Sarita V. Adve,et al.  Using speculative retirement and larger instruction windows to narrow the performance gap between memory consistency models , 1997, SPAA '97.

[72]  Josep Torrellas,et al.  Distance-adaptive update protocols for scalable shared-memory multiprocessors , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[73]  Mark D. Hill,et al.  A Unified Formalization of Four Shared-Memory Models , 1993, IEEE Trans. Parallel Distributed Syst..

[74]  Thomas F. Wenisch,et al.  InvisiFence: performance-transparent memory ordering in conventional multiprocessors , 2009, ISCA '09.

[75]  Stephen Taylor,et al.  Concurrent simulation of neutral flow in the GEC reference cell , 2000 .

[76]  Mark D. Hill,et al.  Using prediction to accelerate coherence protocols , 1998, ISCA.

[77]  Jim Nilsson,et al.  Improving performance of load-store sequences for transaction processing workloads on multiprocessors , 1999, Proceedings of the 1999 International Conference on Parallel Processing.

[78]  Alan E. Charlesworth The Sun Fireplane SMP interconnect in the Sun Fire 3800-6800 , 2001, HOT 9 Interconnects. Symposium on High Performance Interconnects.

[79]  Kourosh Gharachorloo,et al.  Memory consistency models for shared-memory multiprocessors , 1995 .

[80]  Mark D. Hill,et al.  Multiprocessors Should Support Simple Memory-Consistency Models , 1998, Computer.

[81]  Thomas F. Wenisch,et al.  Mechanisms for store-wait-free multiprocessors , 2007, ISCA '07.

[82]  Daniel J. Sorin,et al.  Fault Tolerant Computer Architecture , 2009, Fault Tolerant Computer Architecture.

[83]  Dennis Shasha,et al.  Efficient and correct execution of parallel programs that share memory , 1988, TOPL.

[84]  William J. Dally,et al.  Virtual-channel flow control , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[85]  Rainer Hoch,et al.  From paper to office document standard representation , 1992, Computer.

[86]  Anant Agarwal,et al.  Closing the window of vulnerability in multiphase memory transactions , 1992, ASPLOS V.

[87]  Jade Alglave,et al.  Litmus: Running Tests against Hardware , 2011, TACAS.

[88]  S. J. Frank,et al.  Tightly coupled multiprocessor system speeds memory-access times , 1984 .

[89]  Andreas Steininger,et al.  Is Asynchronous Logic More Robust Than Synchronous Logic? , 2009, IEEE Transactions on Dependable and Secure Computing.

[90]  Tom Ridge,et al.  The semantics of x86-CC multiprocessor machine code , 2009, POPL '09.

[91]  Richard L. Sites,et al.  Alpha Architecture Reference Manual , 1995 .

[92]  W. Martin,et al.  Out of Thin Air , 2008, Science.

[93]  Eric M. Schwarz,et al.  IBM POWER6 microarchitecture , 2007, IBM J. Res. Dev..

[94]  Jaehyuk Huh,et al.  Coherence decoupling: making use of incoherence , 2004, ASPLOS XI.

[95]  Kevin M. Lepak,et al.  Cache Hierarchy and Memory Subsystem of the AMD Opteron Processor , 2010, IEEE Micro.

[96]  Mikko H. Lipasti,et al.  Correctly implementing value prediction in microprocessors that support multithreading or multiprocessing , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[97]  Daniel J. Sorin,et al.  UNified Instruction/Translation/Data (UNITD) coherence: One protocol to rule them all , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.