Making Sequential Consistency Practical in Titanium

The memory consistency model in shared memory parallel programming controls the order in which memory operations performed by one thread may be observed by another. The most natural model for programmers is to have memory accesses appear to take effect in the order specified in the original program. Language designers have been reluctant to use this strong semantics, called sequential consistency, due to concerns over the performance of memory fence instructions and related mechanisms that guarantee order. In this paper, we provide evidence for the practicality of sequential consistency by showing that advanced compiler analysis techniques are sufficient to eliminate the need for most memory fences and enable high-level optimizations. Our analyses eliminated over 97% of the memory fences that were needed by a na¨ýve implementation, accounting for 87 to 100% of the dynamically encountered fences in all but one benchmark. The impact of the memory model and analysis on runtime performance depends on the quality of the optimizations: more aggressive optimizations are likely to be invalidated by a strong memory consistency semantics. We consider two specific optimizations pipelining of bulk memory copies and communication aggregation and scheduling for irregular accesses and show that our most aggressive analysis is able to obtain the same performance as the relaxed model when applied to two linear algebra kernels. While additional work on parallel optimizations and analyses is needed, we believe these results provide important evidence on the viability of using a simple memory consistency model without sacrificing performance.

[1]  Susan J. Eggers,et al.  Static Analysis of Barrier Synchronization in Explicitly Parallel Programs , 1994, IFIP PACT.

[2]  Barton P. Miller,et al.  What are race conditions?: Some issues and formalizations , 1992, LOPL.

[3]  Phillip Colella,et al.  Adaptive mesh refinement in Titanium , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[4]  Mary Lou Soffa,et al.  Concurrency analysis in the presence of procedures using a data-flow framework , 1991, TAV4.

[5]  Laurie J. Hendren,et al.  Communication optimizations for parallel C programs , 1998, J. Parallel Distributed Comput..

[6]  William Pugh Fixing the Java memory model , 1999, JAVA '99.

[7]  Katherine Yelick,et al.  Titanium Language Reference Manual , 2001 .

[8]  Jaejin Lee,et al.  Hiding relaxed memory consistency with compilers , 2000, Proceedings 2000 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00622).

[9]  Dennis Shasha,et al.  Efficient and correct execution of parallel programs that share memory , 1988, TOPL.

[10]  Leslie Lamport,et al.  How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs , 2016, IEEE Transactions on Computers.

[11]  David A. Padua,et al.  Automatic Implementation of Programming Language Consistency Models , 2002, LCPC.

[12]  Jimmy Su,et al.  Automatic support for irregular computations in a high-level language , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[13]  David H. Bailey,et al.  The NAS parallel benchmarks summary and preliminary results , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[14]  Charles Wallace,et al.  The UPC memory model: problems and prospects , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[15]  David A. Padua,et al.  Compiler techniques for high performance sequentially consistent java programs , 2005, PPOPP.

[16]  Lars Ole Andersen,et al.  Program Analysis and Specialization for the C Programming Language , 2005 .

[17]  Alain Darte,et al.  A linear-time algorithm for optimal barrier placement , 2005, PPoPP.

[18]  Barbara G. Ryder,et al.  Non-concurrency analysis , 1993, PPOPP '93.

[19]  P. Colella,et al.  A Finite Difference Domain Decomposition Method Using Local Corrections for the Solution of Poisson's Equation , 1999 .

[20]  Jaejin Lee,et al.  Hiding relaxed memory consistency with a compiler , 2001 .

[21]  Anoop Gupta,et al.  Memory consistency and event ordering in scalable shared-memory multiprocessors , 1990, ISCA '90.

[22]  P. Colella,et al.  Local adaptive mesh refinement for shock hydrodynamics , 1989 .

[23]  Katherine A. Yelick,et al.  Concurrency Analysis for Parallel Programs with Textually Aligned Barriers , 2005, LCPC.

[24]  Dan Bonachea GASNet Specification, v1.1 , 2002 .

[25]  Robert W. Numrich,et al.  Co-array Fortran for parallel programming , 1998, FORF.

[26]  David A. Padua,et al.  Concurrent Static Single Assignment Form and Constant Propagation for Explicitly Parallel Programs , 1997, LCPC.

[27]  David Gay,et al.  Barrier inference , 1998, POPL '98.

[28]  Katherine A. Yelick,et al.  Analyses and Optimizations for Shared Address Space Programs , 1996, J. Parallel Distributed Comput..

[29]  Katherine Yelick,et al.  A proposal for a UPC memory consistency model, v1.0 , 2004 .

[30]  Katherine A. Yelick,et al.  Titanium: A High-performance Java Dialect , 1998, Concurr. Pract. Exp..

[31]  Thomas W. Reps,et al.  Program analysis via graph reachability , 1997, Inf. Softw. Technol..

[32]  Katherine A. Yelick,et al.  Titanium Performance and Potential: An NPB Experimental Study , 2005, LCPC.

[33]  James Hicks,et al.  Experiences with compiler-directed storage reclamation , 1993, FPCA '93.

[34]  Katherine Yelick,et al.  Introduction to UPC and Language Specification , 2000 .

[35]  Martin C. Rinard,et al.  Pointer analysis for multithreaded programs , 1999, PLDI '99.

[36]  Katherine A. Yelick,et al.  Type Systems for Distributed Data Sharing , 2003, SAS.