Eliminating synchronization bottlenecks using adaptive replication

This article presents a new technique, adaptive replication, for automatically eliminating synchronization bottlenecks in multithreaded programs that perform atomic operations on objects. Synchronization bottlenecks occur when multiple threads attempt to concurrently update the same object. It is often possible to eliminate synchronization bottlenecks by replicating objects. Each thread can then update its own local replica without synchronization and without interacting with other threads. When the computation needs to access the original object, it combines the replicas to produce the correct values in the original object. One potential problem is that eagerly replicating all objects may lead to performance degradation and excessive memory consumption.Adaptive replication eliminates unnecessary replication by dynamically detecting contention at each object to find and replicate only those objects that would otherwise cause synchronization bottlenecks. We have implemented adaptive replication in the context of a parallelizing compiler for a subset of C++. Given an unannotated sequential program written in C++, the compiler automatically extracts the concurrency, determines when it is legal to apply adaptive replication, and generates parallel code that uses adaptive replication to efficiently eliminate synchronization bottlenecks.In addition to automatic parallelization and adaptive replication, our compiler also implements a lock coarsening transformation that increases the granularity at which the computation locks objects. The advantage is a reduction in the frequency with which the computation acquires and releases locks; the potential disadvantage is the introduction of new synchronization bottlenecks caused by increases in the sizes of the critical sections. Because the adaptive replication transformation takes place at lock acquisition sites, there is a synergistic interaction between lock coarsening and adaptive replication. Lock coarsening drives down the overhead of using adaptive replication, and adaptive replication eliminates synchronization bottlenecks associated with the overaggressive use of lock coarsening.Our experimental results show that, for our set of benchmark programs, the combination of lock coarsening and adaptive replication can eliminate synchronization bottlenecks and significantly reduce the synchronization and replication overhead as compared to versions that use none or only one of the transformations.

[1]  James R. Larus,et al.  A concurrent copying garbage collector for languages that distinguish (im)mutable data , 1993, PPOPP '93.

[2]  Allan L. Fisher,et al.  Flattening and parallelizing irregular, recurrent loop nests , 1995, PPOPP '95.

[3]  Paul Rovner Extending Modula-2 to Build Large, Integrated Systems , 1986, IEEE Software.

[4]  Julian Dolby Automatic inline allocation of objects , 1997, PLDI '97.

[5]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[6]  I.S. Khalil,et al.  Decentralized optimal power pricing: the development of a parallel program , 1993, IEEE Parallel & Distributed Technology: Systems & Applications.

[7]  Piet Hut,et al.  A hierarchical O(N log N) force-calculation algorithm , 1986, Nature.

[8]  Robert I. Winner,et al.  Automated Vertical Migration to Dynamic Microcode: An Overview and Example , 1986, IEEE Software.

[9]  Anoop Gupta,et al.  SPLASH: Stanford parallel applications for shared-memory , 1992, CARN.

[10]  Geoffrey C. Fox,et al.  The Perfect Club Benchmarks: Effective Performance Evaluation of Supercomputers , 1989, Int. J. High Perform. Comput. Appl..

[11]  Ron Y. Pinter,et al.  Program optimization and parallelization using idioms , 1991, POPL '91.

[12]  Mary W. Hall,et al.  Detecting Coarse - Grain Parallelism Using an Interprocedural Parallelizing Compiler , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[13]  Emin Gün Sirer,et al.  Static Analyses for Eliminating Unnecessary Synchronization from Java Programs , 1999, SAS.

[14]  Daniel E. Lenoski,et al.  The design and analysis of DASH: a scalable directory-based multiprocessor , 1992 .

[15]  Allan L. Fisher,et al.  Parallelizing complex scans and reductions , 1994, PLDI '94.

[16]  Martin C. Rinard,et al.  Pointer analysis for multithreaded programs , 1999, PLDI '99.

[17]  Alan L. Cox,et al.  TreadMarks: shared memory computing on networks of workstations , 1996 .

[18]  Per Brinch Hansen,et al.  Structured multiprogramming , 1972, CACM.

[19]  Lui Sha,et al.  Priority Inheritance Protocols: An Approach to Real-Time Synchronization , 1990, IEEE Trans. Computers.

[20]  Robert H. Halstead,et al.  Lazy task creation: a technique for increasing the granularity of parallel programs , 1990, LISP and Functional Programming.

[21]  David E. Culler,et al.  Decentralized optimal power pricing: the development of a parallel program , 1993, IEEE Parallel Distributed Technol. Syst. Appl..

[22]  Ken Arnold,et al.  The Java Programming Language , 1996 .

[23]  Donald E. Knuth,et al.  An empirical study of FORTRAN programs , 1971, Softw. Pract. Exp..

[24]  Damien Doligez,et al.  A concurrent, generational garbage collector for a multithreaded implementation of ML , 1993, POPL '93.

[25]  Monica S. Lam,et al.  Efficient context-sensitive pointer analysis for C programs , 1995, PLDI '95.

[26]  D. Callahan,et al.  Recognizing and Parallelizing Bounded Recurrences , 1991, LCPC.

[27]  Martin C. Rinard,et al.  Commutativity analysis: a new analysis technique for parallelizing compilers , 1997, TOPL.

[28]  Jaswinder Pal Singh,et al.  Hierarchical n-body methods and their implications for multiprocessors , 1993 .

[29]  James C. King,et al.  Symbolic execution and program testing , 1976, CACM.

[30]  Butler W. Lampson,et al.  Experience with processes and monitors in Mesa , 1980, CACM.

[31]  CONSTANTINE D. POLYCHRONOPOULOS,et al.  Guided Self-Scheduling: A Practical Scheduling Scheme for Parallel Supercomputers , 1987, IEEE Transactions on Computers.

[32]  Monica S. Lam,et al.  The design, implementation, and evaluation of Jade , 1998, TOPL.

[33]  Scott Nettles,et al.  Concurrent replicating garbage collection , 1994, LFP '94.

[34]  Susan L. Graham,et al.  Gprof: A call graph execution profiler , 1982, SIGPLAN '82.

[35]  Laurie J. Hendren,et al.  Context-sensitive interprocedural points-to analysis in the presence of function pointers , 1994, PLDI '94.

[36]  Andreas Reuter,et al.  Transaction Processing: Concepts and Techniques , 1992 .

[37]  Martin C. Rinard,et al.  Eliminating synchronization bottlenecks in object-based programs using adaptive replication , 1999, ICS '99.

[38]  Mark N. Wegman,et al.  Analysis of pointers and structures , 1990, SIGP.

[39]  Martin C. Rinard,et al.  Eliminating synchronization overhead in automatically parallelized programs using dynamic feedback , 1999, TOCS.

[40]  Laurie J. Hendren,et al.  Is it a tree, a DAG, or a cyclic graph? A shape analysis for heap-directed pointers in C , 1996, POPL '96.

[41]  P. Brinch-Hansen,et al.  The programming language Concurrent Pascal , 1975 .

[42]  Per Brinch Hansen,et al.  The programming language Concurrent Pascal , 1975, IEEE Transactions on Software Engineering.

[43]  C. A. R. Hoare,et al.  Monitors: an operating system structuring concept , 1974, CACM.

[44]  Tim Beardsley,et al.  Strategic Defense Initiative: Academicians doubt efficacy , 1986, Nature.

[45]  Reinaldo J. Michelena,et al.  Tomographic string inversion , 1990 .

[46]  Martin C. Rinard,et al.  Lock Coarsening: Eliminating Lock Overhead in Automatically Parallelized Object-Based Programs , 1996, LCPC.

[47]  Reinhard Wilhelm,et al.  Solving shape-analysis problems in languages with destructive updating , 1998, TOPL.

[48]  Ken Arnold,et al.  The Java programming language (2nd ed.) , 1998 .

[49]  Kai Li,et al.  Shared virtual memory on loosely coupled multiprocessors , 1986 .

[50]  Martin C. Rinard,et al.  Lock Coarsening: Eliminating Lock Overhead in Automatically Parallelized Object-Based Programs , 1998, J. Parallel Distributed Comput..