Lock Coarsening: Eliminating Lock Overhead in Automatically Parallelized Object-Based Programs

Atomic operations are a key primitive in parallel computing systems. The standard implementation mechanism for atomic operations uses mutual exclusion locks. In an object-based programming system, the natural granularity is to give each object its own lock. Each operation can then make its execution atomic by acquiring and releasing the lock for the object that it accesses. But this fine lock granularity may have high synchronization overhead because it maximizes the number of executed acquire and release constructs. To achieve good performance it may be necessary to reduce the overhead by coarsening the granularity at which the computation locks objects.In this article we describe a static analysis technique?lock coarsening?designed to automatically increase the lock granularity in object-based programs with atomic operations. We have implemented this technique in the context of a parallelizing compiler for irregular, object-based programs and used it to improve the generated parallel code. Experiments with two automatically parallelized applications show these algorithms to be effective in reducing the lock overhead to negligible levels. The results also show, however, that an overly aggressive lock coarsening algorithm may harm the overall parallel performance by serializing sections of the parallel computation. A successful compiler must therefore negotiate a trade-off between reducing lock overhead and increasing the serialization.

[1]  Ron Cytron,et al.  Doacross: Beyond Vectorization for Multiprocessors , 1986, ICPP.

[2]  Anoop Gupta,et al.  SPLASH: Stanford parallel applications for shared-memory , 1992, CARN.

[3]  Peter Dadam,et al.  A Lock Technique for Disjoint and Non-Disjoint Complex Objects , 1990, EDBT.

[4]  Piet Hut,et al.  A hierarchical O(N log N) force-calculation algorithm , 1986, Nature.

[5]  Daniel E. Lenoski,et al.  The design and analysis of DASH: a scalable directory-based multiprocessor , 1992 .

[6]  Beng-Hong Lim,et al.  Reactive synchronization algorithms for multiprocessors , 1994, ASPLOS VI.

[7]  Richard W. Earnshaw,et al.  Challenges in cross-development [single chip microprocessors] , 1997, IEEE Micro.

[8]  Robert Wilson,et al.  Compiling Java just in time , 1997, IEEE Micro.

[9]  John L. Hennessy,et al.  Finding and Exploiting Parallelism in an Ocean Simulation Program: Experience, Results, and Implications , 1992, J. Parallel Distributed Comput..

[10]  Andrew A. Chien,et al.  Obtaining sequential efficiency for concurrent object-oriented languages , 1995, POPL '95.

[11]  Jonathan Rose LocusRoute: a parallel global router for standard cells , 1988, 25th ACM/IEEE, Design Automation Conference.Proceedings 1988..

[12]  Martin C. Rinard,et al.  Commutativity analysis: a new analysis framework for parallelizing compilers , 1996, PLDI '96.

[13]  Chau-Wen Tseng,et al.  Compiler optimizations for eliminating barrier synchronization , 1995, PPOPP '95.

[14]  Martin C. Rinard,et al.  Dynamic feedback: an effective technique for adaptive computing , 1997, PLDI '97.

[15]  Earl E. Swartzlander,et al.  Proceedings of the 1986 International Conference on Parallel Processing/August 19-22, 1986 , 1986 .

[16]  Gerry Kane,et al.  MIPS RISC Architecture , 1987 .

[17]  Mary K. Vernon,et al.  Efficient synchronization primitives for large-scale cache-coherent multiprocessors , 1989, ASPLOS 1989.

[18]  Robert H. Halstead,et al.  Lazy task creation: a technique for increasing the granularity of parallel programs , 1990, LISP and Functional Programming.

[19]  David A. Padua,et al.  Compiler Algorithms for Synchronization , 1987, IEEE Transactions on Computers.

[20]  Ken Arnold,et al.  The Java Programming Language , 1996 .