Memory consistency models for shared-memory multiprocessors

The memory consistency model for a shared-memory multiprocessor specifies the behavior of memory with respect to read and write operations from multiple processors. As such, the memory model influences many aspects of system design, including the design of programming languages, compilers, and the underlying hardware. Relaxed models that impose fewer memory ordering constraints offer the potential for higher performance by allowing hardware and software to overlap and reorder memory operations. However, fewer ordering guarantees can compromise programmability and portability. Many of the previously proposed models either fail to provide reasonable programming semantics or are biased toward programming ease at the cost of sacrificing performance. Furthermore, the lack of consensus on an acceptable model hinders software portability across different systems. This dissertation focuses on providing a balanced solution that directly addresses the trade-off between programming ease and performance. To address programmability, we propose an alternative method for specifying memory behavior that presents a higher level abstraction to the programmer. We show that with only a few types of information supplied by the programmer, an implementation can exploit the full range of optimizations enabled by previous models. Furthermore, the same information enables automatic and efficient portability across a wide range of implementations. To expose the optimizations enabled by a model, we have developed a formal framework for specifying the low-level ordering constraints that must be enforced by an implementation. Based on these specifications, we present a wide range of architecture and compiler implementation techniques for efficiently supporting a given model. Finally, we evaluate the performance benefits of exploiting relaxed models based on detailed simulations of realistic parallel applications. Our results show that the optimizations enabled by relaxed models are extremely effective in hiding virtually the full latency of writes in architectures with blocking reads (i.e., processor stalls on reads), with gains as high as 80%. Architectures with non-blocking reads can further exploit relaxed models to hide a substantial fraction of the read latency as well, leading to a larger overall performance benefit. Furthermore, these optimizations complement gains from other latency hiding techniques such as prefetching and multiple contexts. We believe that the combined benefits in hardware and software will make relaxed models universal in future multiprocessors, as is already evidenced by their adoption in several commercial systems.

[1]  Jong-Deok Choi,et al.  An efficient cache-based access anomaly detection scheme , 1991, ASPLOS IV.

[2]  Monica S. Lam,et al.  Jade: a high-level, machine-independent language for parallel programming , 1993, Computer.

[3]  Samuel P. Midkiff,et al.  Compiling programs with user parallelism , 1990 .

[4]  Trevor Mudge,et al.  Performance optimization of pipelined primary cache , 1992, ISCA '92.

[5]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[6]  Roy Friedman,et al.  Shared memory consistency conditions for non-sequential execution: definitions and programming strategies , 1993, SPAA '93.

[7]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[8]  Susan J. Eggers,et al.  On the validity of trace-driven simulation for multiprocessors , 1991, ISCA '91.

[9]  Dennis Shasha,et al.  Efficient and correct execution of parallel programs that share memory , 1988, TOPL.

[10]  Mark D. Hill,et al.  Implementing Sequential Consistency in Cache-Based Systems , 1990, ICPP.

[11]  Robert J. Fowler,et al.  Adaptive cache coherency for detecting migratory shared data , 1993, ISCA '93.

[12]  Richard P. LaRowe,et al.  Hiding Shared Memory Reference Latency on the Galactica Net Distributed Shared Memory Architecture , 1992, J. Parallel Distributed Comput..

[13]  David Padua,et al.  Debugging Fortran on a shared memory machine , 1987 .

[14]  Michael L. Scott,et al.  Algorithms for scalable synchronization on shared-memory multiprocessors , 1991, TOCS.

[15]  Mark D. Hill,et al.  A Unified Formalization of Four Shared-Memory Models , 1993, IEEE Trans. Parallel Distributed Syst..

[16]  Sarita V. Adve,et al.  Designing memory consistency models for shared-memory multiprocessors , 1993 .

[17]  Anoop Gupta,et al.  The directory-based cache coherence protocol for the DASH multiprocessor , 1990, ISCA '90.

[18]  Anant Agarwal,et al.  APRIL: a processor architecture for multiprocessing , 1990, ISCA '90.

[19]  Niklaus Wirth,et al.  Modula: A language for modular multiprogramming , 1977, Softw. Pract. Exp..

[20]  Per Brinch Hansen,et al.  The Architecture of Concurrent Programs , 1977 .

[21]  Silvio Turrini,et al.  Optimal group distribution in carry-skip adders , 1989, Proceedings of 9th Symposium on Computer Arithmetic.

[22]  Butler W. Lampson,et al.  Experience with processes and monitors in Mesa , 1980, CACM.

[23]  Alan L. Cox,et al.  Lazy release consistency for software distributed shared memory , 1992, ISCA '92.

[24]  Anoop Gupta,et al.  Two Techniques to Enhance the Performance of Memory Consistency Models , 1991, ICPP.

[25]  Willy Zwaenepoel,et al.  Implementation and performance of Munin , 1991, SOSP '91.

[26]  W. R. Hamburgen,et al.  Precise robotic paste dot dispensing , 1989, Proceedings., 39th Electronic Components Conference.

[27]  Gregory R. Andrews,et al.  Concurrent programming - principles and practice , 1991 .

[28]  John L. Hennessy,et al.  Finding and Exploiting Parallelism in an Ocean Simulation Program: Experience, Results, and Implications , 1992, J. Parallel Distributed Comput..

[29]  James H. Patterson,et al.  Portable Programs for Parallel Processors , 1987 .

[30]  Mats Brorsson,et al.  An adaptive cache coherence protocol optimized for migratory sharing , 1993, ISCA '93.

[31]  Joe D. Warren,et al.  The program dependence graph and its use in optimization , 1987, TOPL.

[32]  Ken Kennedy,et al.  Parallel program debugging with on-the-fly anomaly detection , 1990, Proceedings SUPERCOMPUTING '90.

[33]  Brian N. Bershad,et al.  PRESTO: A system for object‐oriented parallel programming , 1988, Softw. Pract. Exp..

[34]  Brian N. Bershad,et al.  The Midway distributed shared memory system , 1993, Digest of Papers. Compcon Spring.

[35]  Brian N. Bershad,et al.  Midway : shared memory parallel programming with entry consistency for distributed memory multiprocessors , 1991 .

[36]  Alan L. Cox,et al.  Evaluation of release consistent software distributed shared memory on emerging network technology , 1993, ISCA '93.

[37]  J. Mcdonald,et al.  Vectorization of a particle simulation method for hypersonic rarefied flow , 1988 .

[38]  Leslie Lamport,et al.  How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs , 2016, IEEE Transactions on Computers.

[39]  Katherine A. Yelick,et al.  Optimizing parallel programs with explicit synchronization , 1995, PLDI '95.

[40]  James R. Goodman,et al.  Cache Consistency and Sequential Consistency , 1991 .

[41]  Bob Beck,et al.  Shared-memory parallel programming in C++ , 1990, IEEE Software.

[42]  Andrew R. Pleszkun,et al.  Implementation of precise interrupts in pipelined processors , 1985, ISCA '98.

[43]  David B. Loveman High performance Fortran , 1993, IEEE Parallel & Distributed Technology: Systems & Applications.

[44]  Barton P. Miller,et al.  On the Complexity of Event Ordering for Shared-Memory Parallel Program Executions , 1990, ICPP.

[45]  William W. Collier,et al.  Reasoning about parallel architectures , 1992 .

[46]  Larry Rudolph,et al.  Dynamic decentralized cache schemes for mimd parallel processors , 1984, ISCA 1984.

[47]  David W. Wall,et al.  Systems for Late Code Modification , 1991, Code Generation.

[48]  Barton P. Miller,et al.  Detecting data races on weak memory systems , 1991, ISCA '91.

[49]  Michel Dubois,et al.  Concurrent Miss Resolution in Multiprocessor Caches , 1988, ICPP.

[50]  Yale Patt,et al.  Exploiting fine-grained parallelism through a combination of hardware and software techniques , 1991, ISCA '91.

[51]  A. Gupta,et al.  Parallel distributed-time logic simulation , 1989, IEEE Design & Test of Computers.

[52]  William M. Johnson,et al.  Super-scalar processor design , 1989 .

[53]  Cathy May,et al.  The PowerPC Architecture: A Specification for a New Family of RISC Processors , 1994 .

[54]  Anant Agarwal,et al.  Closing the window of vulnerability in multiphase memory transactions , 1992, ASPLOS V.

[55]  Daniel E. Lenoski,et al.  The design and analysis of DASH: a scalable directory-based multiprocessor , 1992 .

[56]  Anoop Gupta,et al.  Programming for Different Memory Consistency Models , 1992, J. Parallel Distributed Comput..

[57]  Michel Cekleov,et al.  Formal Specification of Memory Models , 1992 .

[58]  Yehuda Afek,et al.  A lazy cache algorithm , 1989, SPAA '89.

[59]  Ken Kennedy,et al.  Compile-time detection of race conditions in a parallel program , 1989, ICS '89.

[60]  Maurice Herlihy,et al.  Linearizability: a correctness condition for concurrent objects , 1990, TOPL.

[61]  Michael D. Smith,et al.  Boosting beyond static scheduling in a superscalar processor , 1990, ISCA '90.

[62]  Barton P. Miller,et al.  Improving the accuracy of data race detection , 1991, PPOPP '91.

[63]  Michel Dubois,et al.  Access ordering and coherence in shared memory multiprocessors , 1989 .

[64]  Jeffrey C. Mogul Observing TCP dynamics in real networks , 1992, SIGCOMM 1992.

[65]  Alan Jay Smith,et al.  Branch Prediction Strategies and Branch Target Buffer Design , 1995, Computer.

[66]  Jonathan Rose LocusRoute: a parallel global router for standard cells , 1988, 25th ACM/IEEE, Design Automation Conference.Proceedings 1988..

[67]  Todd C. Mowry,et al.  Tolerating latency through software-controlled data prefetching , 1994 .

[68]  Arthur J. Bernstein,et al.  Analysis of Programs for Parallel Processing , 1966, IEEE Trans. Electron. Comput..

[69]  Anoop Gupta,et al.  Tolerating Latency Through Software-Controlled Prefetching in Shared-Memory Multiprocessors , 1991, J. Parallel Distributed Comput..

[70]  Mike Johnson,et al.  Superscalar microprocessor design , 1991, Prentice Hall series in innovative technology.

[71]  Anoop Gupta,et al.  The Stanford FLASH Multiprocessor , 1994, ISCA.

[72]  Norman P. Jouppi,et al.  Complexity/performance tradeoffs with non-blocking loads , 1994, ISCA '94.

[73]  Paul Hudak,et al.  Memory coherence in shared virtual memory systems , 1989, TOCS.

[74]  Anoop Gupta,et al.  SPLASH: Stanford parallel applications for shared-memory , 1992, CARN.

[75]  R.K. Brayton,et al.  Automatic verification of memory systems which service their requests out of order , 1995, Proceedings of ASP-DAC'95/CHDL'95/VLSI'95 with EDA Technofair.

[76]  Michel Dubois,et al.  Lockup-free Caches in High-Performance Multiprocessors , 1990, J. Parallel Distributed Comput..

[77]  James P. Laudon,et al.  Architectural and Implementation Tradeoffs for Multiple-Context Processors , 1995 .

[78]  Anoop Gupta,et al.  Comparative evaluation of latency reducing and tolerating techniques , 1991, ISCA '91.

[79]  Katherine A. Yelick,et al.  Optimizing Parallel SPMD Programs , 1994, LCPC.

[80]  Kevin P. McAuliffe,et al.  RP3 Processor-Memory Element , 1985, ICPP.

[81]  Russell Kao,et al.  Piecewise Linear Models for Switch-Level Simulation , 1992 .

[82]  Helen Davis,et al.  Tango introduction and tutorial , 1990 .

[83]  Barton P. Miller,et al.  Detecting Data Races in Parallel Program Executions , 1989 .

[84]  Michel Dubois,et al.  Correct memory operation of cache-based multiprocessors , 1987, ISCA '87.

[85]  Edith Schonberg,et al.  An empirical comparison of monitoring algorithms for access anomaly detection , 2011, PPOPP '90.

[86]  Richard N. Taylor,et al.  A general-purpose algorithm for analyzing concurrent programs , 1983, CACM.

[87]  Anoop Gupta,et al.  Hiding memory latency using dynamic scheduling in shared-memory multiprocessors , 1992, ISCA '92.

[88]  Christos H. Papadimitriou,et al.  The Theory of Database Concurrency Control , 1986 .

[89]  Michel Dubois,et al.  Delayed consistency and its effects on the miss rate of parallel programs , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[90]  Jean-Loup Baer,et al.  A performance study of memory consistency models , 1992, ISCA '92.

[91]  Stein Gjessing,et al.  Distributed-directory scheme: scalable coherent interface , 1990, Computer.

[92]  Michel Dubois,et al.  Memory Access Dependencies in Shared-Memory Multiprocessors , 1990, IEEE Trans. Software Eng..

[93]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[94]  Erik Hagersten,et al.  Race-free interconnection networks and multiprocessor consistency , 1991, [1991] Proceedings. The 18th Annual International Symposium on Computer Architecture.

[95]  Anant Agarwal,et al.  LimitLESS directories: A scalable cache coherence scheme , 1991, ASPLOS IV.

[96]  Robert M. Keller,et al.  Look-Ahead Processors , 1975, CSUR.

[97]  Josep Torrellas,et al.  False Sharing ans Spatial Locality in Multiprocessor Caches , 1994, IEEE Trans. Computers.

[98]  Josep Torrellas,et al.  Estimating the Performance Advantages of Relaxing Consistency in a Shared Memory Multiprocessor , 1990, ICPP.

[99]  Werner Buchholz,et al.  Planning a Computer System: Project Stretch , 1962 .

[100]  Yehuda Afek,et al.  Lazy caching , 1993, TOPL.

[101]  Richard L. Sites,et al.  Alpha AXP architecture reference manual , 1995 .

[102]  Kourosh Gharachorloo,et al.  Detecting violations of sequential consistency , 1991, SPAA '91.

[103]  David L. Dill,et al.  An executable specification, analyzer and verifier for RMO (relaxed memory order) , 1995, SPAA '95.

[104]  Anoop Gupta,et al.  The Stanford Dash multiprocessor , 1992, Computer.

[105]  Anoop Gupta,et al.  Performance evaluation of memory consistency models for shared-memory multiprocessors , 1991, ASPLOS IV.

[106]  Richard Noah Zucker,et al.  Relaxed consistency and synchronization in parallel processors , 1992 .

[107]  Jr. Richard Thomas Simoni,et al.  Cache coherence directories for scalable multiprocessors , 1992 .