Memory consistency models for shared memory multiprocessors

The memory consistency model for a shared-memory multiprocessor specifies the behavior of memory with respect to read and write operations from multiple processors. As such, the memory model influences many aspects of system design, including the design of programming languages, compilers, and the underlying hardware. Relaxed models that impose fewer memory ordering constraints offer the potential for higher performance by allowing hardware and software to overlap and reorder memory operations. However, fewer ordering guarantees can compromise programmability and portability. Many of the previously proposed models either fail to provide reasonable programming semantics or are biased toward programming ease at the cost of sacrificing performance. Furthermore, the lack of consensus on an acceptable model hinders software portability across different systems. This dissertation focuses on providing a balanced solution that directly addresses the trade-off between programming ease and performance. To address programmability, we propose an alternative method for specifying memory behavior that presents a higher level abstraction to the programmer. We show that with only a few types of information supplied by the programmer, an implementation can exploit the full range of optimizations enabled by previous models. Furthermore, the same information enables automatic and efficient portability across a wide range of implementations. To expose the optimizations enabled by a model, we have developed a formal framework for specifying the low-level ordering constraints that must be enforced by an implementation. Based on these specifications, we present a wide range of architecture and compiler implementation techniques for efficiently supporting a given model. Finally, we evaluate the performance benefits of exploiting relaxed models based on detailed simulations of realistic parallel applications. Our results show that the optimizations enabled by relaxed models are extremely effective in hiding virtually the full latency of writes in architectures with blocking reads (i.e., processor stalls on reads), with gains as high as 80%. Architectures with nonblocking reads can further exploit relaxed models to hide a substantial fraction of the read latency as well, leading to a larger overall performance benefit. Furthermore, these optimizations complement gains from other latency hiding techniques such as prefetching and multiple contexts. We believe that the combined benefits in hardware and software will make relaxed models universal in future multiprocessors, as is already evidenced by their adoption in several commercial systems.

[1]  Anoop Gupta,et al.  Sufficient System Requirements for Supporting the PLpc Memory Model , 1993 .

[2]  Jong-Deok Choi,et al.  An efficient cache-based access anomaly detection scheme , 1991, ASPLOS IV.

[3]  David B. Gustavson,et al.  Scalable Coherent Interface , 1990, COMPEURO'90: Proceedings of the 1990 IEEE International Conference on Computer Systems and Software Engineering@m_Systems Engineering Aspects of Complex Computerized Systems.

[4]  Yehuda Afek,et al.  A lazy cache algorithm , 1989, SPAA '89.

[5]  Barton P. Miller,et al.  Improving the accuracy of data race detection , 1991, PPOPP '91.

[6]  T. Mowry,et al.  Comparative evaluation of latency reducing and tolerating techniques , 1991, [1991] Proceedings. The 18th Annual International Symposium on Computer Architecture.

[7]  Kourosh Gharachorloo,et al.  Detecting violations of sequential consistency , 1991, SPAA '91.

[8]  David L. Dill,et al.  An executable specification, analyzer and verifier for RMO (relaxed memory order) , 1995, SPAA '95.

[9]  Anoop Gupta,et al.  The Stanford Dash multiprocessor , 1992, Computer.

[10]  A. Gupta,et al.  Parallel distributed-time logic simulation , 1989, IEEE Design & Test of Computers.

[11]  William M. Johnson,et al.  Super-scalar processor design , 1989 .

[12]  Michael D. Smith,et al.  Boosting beyond static scheduling in a superscalar processor , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[13]  Anant Agarwal,et al.  Closing the window of vulnerability in multiphase memory transactions , 1992, ASPLOS V.

[14]  Daniel E. Lenoski,et al.  The design and analysis of DASH: a scalable directory-based multiprocessor , 1992 .

[15]  Anoop Gupta,et al.  Tolerating Latency Through Software-Controlled Prefetching in Shared-Memory Multiprocessors , 1991, J. Parallel Distributed Comput..

[16]  Mike Johnson,et al.  Superscalar microprocessor design , 1991, Prentice Hall series in innovative technology.

[17]  Michel Dubois,et al.  Memory access buffering in multiprocessors , 1998, ISCA '98.

[18]  Yale N. Patt,et al.  Exploiting Fine-Grained Parallelism Through a Combination of Hardware and Software Techniques , 1991, ISCA.

[19]  J.P. Singh Implications of Hierarchical N-body Methods for Multiprocessor Architecture , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[20]  Samuel P. Midkiff,et al.  Compiling programs with user parallelism , 1990 .

[21]  J. Mcdonald,et al.  Vectorization of a particle simulation method for hypersonic rarefied flow , 1988 .

[22]  Leslie Lamport,et al.  How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs , 2016, IEEE Transactions on Computers.

[23]  Katherine A. Yelick,et al.  Optimizing parallel programs with explicit synchronization , 1995, PLDI '95.

[24]  Roy Friedman,et al.  Shared memory consistency conditions for non-sequential execution: definitions and programming strategies , 1993, SPAA '93.

[25]  James R. Goodman,et al.  Cache Consistency and Sequential Consistency , 1991 .

[26]  Francisco Corella,et al.  Specification of the powerpc shared memory architecture , 1993 .

[27]  Cathy May,et al.  The PowerPC Architecture: A Specification for a New Family of RISC Processors , 1994 .

[28]  Michael Stumm,et al.  Cache consistency in hierarchical-ring-based multiprocessors , 1992, Proceedings Supercomputing '92.

[29]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[30]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[31]  Anoop Gupta,et al.  Specifying system requirements for memory consistency models , 1993 .

[32]  Josep Torrellas,et al.  False Sharing ans Spatial Locality in Multiprocessor Caches , 1994, IEEE Trans. Computers.

[33]  Richard P. LaRowe,et al.  Hiding Shared Memory Reference Latency on the Galactica Net Distributed Shared Memory Architecture , 1992, J. Parallel Distributed Comput..

[34]  Sarita V. Adve,et al.  Designing memory consistency models for shared-memory multiprocessors , 1993 .

[35]  David Padua,et al.  Debugging Fortran on a shared memory machine , 1987 .

[36]  Roy Friedman,et al.  A Correctness Condition for High-Performance Multiprocessors , 1998, SIAM J. Comput..

[37]  David W. Wall,et al.  Link-Time Code Modification , 1989 .

[38]  Paul Hudak,et al.  Memory coherence in shared virtual memory systems , 1986, PODC '86.

[39]  Michel Dubois,et al.  Access ordering and coherence in shared memory multiprocessors , 1989 .

[40]  Michael L. Scott,et al.  Algorithms for scalable synchronization on shared-memory multiprocessors , 1991, TOCS.

[41]  Alan L. Cox,et al.  Lazy release consistency for software distributed shared memory , 1992, ISCA '92.

[42]  Jonathan Rose LocusRoute: a parallel global router for standard cells , 1988, 25th ACM/IEEE, Design Automation Conference.Proceedings 1988..

[43]  Mark D. Hill,et al.  A Unified Formalization of Four Shared-Memory Models , 1993, IEEE Trans. Parallel Distributed Syst..

[44]  Anant Agarwal,et al.  APRIL: a processor architecture for multiprocessing , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[45]  Michel Dubois,et al.  Correct memory operation of cache-based multiprocessors , 1987, ISCA '87.

[46]  Michael E. Wolf,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[47]  Larry Rudolph,et al.  Dynamic decentralized cache schemes for mimd parallel processors , 1984, ISCA '84.

[48]  Edith Schonberg,et al.  An empirical comparison of monitoring algorithms for access anomaly detection , 2011, PPOPP '90.

[49]  Anoop Gupta,et al.  Two Techniques to Enhance the Performance of Memory Consistency Models , 1991, ICPP.

[50]  Richard N. Taylor,et al.  A general-purpose algorithm for analyzing concurrent programs , 1983, CACM.

[51]  Anoop Gupta,et al.  Hiding memory latency using dynamic scheduling in shared-memory multiprocessors , 1992, ISCA '92.

[52]  Brian N. Bershad,et al.  Midway : shared memory parallel programming with entry consistency for distributed memory multiprocessors , 1991 .

[53]  Mark D. Hill,et al.  Weak ordering—a new definition , 1998, ISCA '98.

[54]  Mats Brorsson,et al.  An adaptive cache coherence protocol optimized for migratory sharing , 1993, ISCA '93.

[55]  Ken Kennedy,et al.  Parallel program debugging with on-the-fly anomaly detection , 1990, Proceedings SUPERCOMPUTING '90.

[56]  Anoop Gupta,et al.  The directory-based cache coherence protocol for the DASH multiprocessor , 1990, ISCA '90.

[57]  Christos H. Papadimitriou,et al.  The Theory of Database Concurrency Control , 1986 .

[58]  Bob Beck,et al.  Shared-memory parallel programming in C++ , 1990, IEEE Software.

[59]  Kourosh Gharachorloo,et al.  Proving sequential consistency of high-performance shared memories (extended abstract) , 1991, SPAA '91.

[60]  Alan L. Cox,et al.  Evaluation of release consistent software distributed shared memory on emerging network technology , 1993, ISCA '93.

[61]  Michel Dubois,et al.  Delayed consistency and its effects on the miss rate of parallel programs , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[62]  Andrew R. Pleszkun,et al.  Implementation of precise interrupts in pipelined processors , 1985, ISCA '98.

[63]  Anoop Gupta,et al.  SPLASH: Stanford parallel applications for shared-memory , 1992, CARN.

[64]  Niklaus Wirth,et al.  Modula: A language for modular multiprogramming , 1977, Softw. Pract. Exp..

[65]  Michel Dubois,et al.  Memory Access Dependencies in Shared-Memory Multiprocessors , 1990, IEEE Trans. Software Eng..

[66]  Per Brinch Hansen,et al.  The Architecture of Concurrent Programs , 1977 .

[67]  Robert H. B. Netzer,et al.  Detecting data races on weak memory systems , 1991, [1991] Proceedings. The 18th Annual International Symposium on Computer Architecture.

[68]  David B. Loveman High performance Fortran , 1993, IEEE Parallel & Distributed Technology: Systems & Applications.

[69]  Barton P. Miller,et al.  On the Complexity of Event Ordering for Shared-Memory Parallel Program Executions , 1990, ICPP.

[70]  R.K. Brayton,et al.  Automatic verification of memory systems which service their requests out of order , 1995, Proceedings of ASP-DAC'95/CHDL'95/VLSI'95 with EDA Technofair.

[71]  William W. Collier,et al.  Reasoning about parallel architectures , 1992 .

[72]  Michel Dubois,et al.  Lockup-free Caches in High-Performance Multiprocessors , 1990, J. Parallel Distributed Comput..

[73]  Ken Kennedy,et al.  Compile-time detection of race conditions in a parallel program , 1989, ICS '89.

[74]  John L. Hennessy,et al.  Finding and Exploiting Parallelism in an Ocean Simulation Program: Experience, Results, and Implications , 1992, J. Parallel Distributed Comput..

[75]  James H. Patterson,et al.  Portable Programs for Parallel Processors , 1987 .

[76]  Erik Hagersten,et al.  Race-Free Interconnection Networks and Multiprocessor Consistency , 1991, ISCA.

[77]  Brian N. Bershad,et al.  PRESTO: A system for object‐oriented parallel programming , 1988, Softw. Pract. Exp..

[78]  Brian N. Bershad,et al.  The Midway distributed shared memory system , 1993, Digest of Papers. Compcon Spring.

[79]  Joe D. Warren,et al.  The program dependence graph and its use in optimization , 1984, TOPL.

[80]  Anoop Gupta,et al.  Programming for Different Memory Consistency Models , 1992, J. Parallel Distributed Comput..

[81]  Michel Cekleov,et al.  Formal Specification of Memory Models , 1992 .

[82]  David W. Wall,et al.  Global register allocation at link time , 1986, SIGPLAN '86.

[83]  Katherine A. Yelick,et al.  Optimizing Parallel SPMD Programs , 1994, LCPC.

[84]  Kunle Olukotun,et al.  Performance Optimization of Pipelined Primary Caches , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[85]  Kevin P. McAuliffe,et al.  RP3 Processor-Memory Element , 1985, ICPP.

[86]  Susan J. Eggers,et al.  On the validity of trace-driven simulation for multiprocessors , 1991, ISCA '91.

[87]  Richard Noah Zucker,et al.  Relaxed consistency and synchronization in parallel processors , 1992 .

[88]  Anoop Gupta,et al.  Performance evaluation of memory consistency models for shared-memory multiprocessors , 1991, ASPLOS IV.

[89]  Jr. Richard Thomas Simoni,et al.  Cache coherence directories for scalable multiprocessors , 1992 .

[90]  Monica S. Lam,et al.  Jade: a high-level, machine-independent language for parallel programming , 1993, Computer.

[91]  Anoop Gupta,et al.  Memory consistency and event ordering in scalable shared-memory multiprocessors , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[92]  Robert J. Fowler,et al.  Adaptive cache coherency for detecting migratory shared data , 1993, ISCA '93.

[93]  Michel Dubois,et al.  Concurrent Miss Resolution in Multiprocessor Caches , 1988, ICPP.

[94]  A. Gupta,et al.  The Stanford FLASH multiprocessor , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[95]  Helen Davis,et al.  Tango introduction and tutorial , 1990 .

[96]  Barton P. Miller,et al.  Detecting Data Races in Parallel Program Executions , 1989 .

[97]  Maurice Herlihy,et al.  Linearizability: a correctness condition for concurrent objects , 1990, TOPL.

[98]  N. Jouppi,et al.  Complexity/performance tradeoffs with non-blocking loads , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[99]  Dennis Shasha,et al.  Efficient and correct execution of parallel programs that share memory , 1988, TOPL.

[100]  Mark D. Hill,et al.  Implementing Sequential Consistency in Cache-Based Systems , 1990, ICPP.

[101]  Anant Agarwal,et al.  LimitLESS directories: A scalable cache coherence scheme , 1991, ASPLOS IV.

[102]  Butler W. Lampson,et al.  Experience with processes and monitors in Mesa , 1980, CACM.

[103]  Josep Torrellas,et al.  Estimating the Performance Advantages of Relaxing Consistency in a Shared Memory Multiprocessor , 1990, ICPP.

[104]  Werner Buchholz,et al.  Planning a Computer System: Project Stretch , 1962 .

[105]  Richard L. Sites,et al.  Alpha AXP architecture reference manual , 1995 .

[106]  Alan Jay Smith,et al.  Branch Prediction Strategies and Branch Target Buffer Design , 1995, Computer.

[107]  Todd C. Mowry,et al.  Tolerating latency through software-controlled data prefetching , 1994 .

[108]  Anoop Gupta,et al.  Techniques for improving the performance of sparse matrix factorization on multiprocessor workstations , 1990, Proceedings SUPERCOMPUTING '90.

[109]  Arthur J. Bernstein,et al.  Analysis of Programs for Parallel Processing , 1966, IEEE Trans. Electron. Comput..

[110]  Willy Zwaenepoel,et al.  Implementation and performance of Munin , 1991, SOSP '91.

[111]  Gregory R. Andrews,et al.  Concurrent programming - principles and practice , 1991 .

[112]  David Callahan,et al.  A future-based parallel language for a general-purpose highly-parallel computer , 1990 .

[113]  James P. Laudon,et al.  Architectural and Implementation Tradeoffs for Multiple-Context Processors , 1995 .

[114]  Jean-Loup Baer,et al.  A Performance Study of Memory Consistency Models , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[115]  Mark D. Hill,et al.  Sufficient Conditions for Implementing theData-Race-Free-1 Memory Model, * , 1992 .