Unifying Barrier and Point-to-Point Synchronization in OpenMP with Phasers

OpenMP is a widely used standard for parallel programing on a broad range of SMP systems. In the OpenMP programming model, synchronization points are specified by implicit or explicit barrier operations. However, certain classes of computations such as stencil algorithms need to specify synchronization only among particular tasks/threads so as to support pipeline parallelism with better synchronization efficiency and data locality than wavefront parallelism using all-to-all barriers. In this paper, we propose two new synchronization constructs in the OpenMP programming model, thread-level phasers and iteration level phasers to support various synchronization patterns such as point-to-point synchronizations and sub-group barriers with neighbor threads. Experimental results on three platforms using numerical applications show performance improvements of phasers over OpenMP barriers of up to 1.74× on an 8-core Intel Nehalem system, up to 1.59× on a 16-core Core-2-Quad system and up to 1.44× on a 32-core IBM Power7 system. It is reasonable to expect larger increases on future manycore processors.

[1]  Martin C. Rinard,et al.  Synchronization transformations for parallel computing , 1999, POPL '97.

[2]  Lawrence Snyder,et al.  The design and development of ZPL , 2007, HOPL.

[3]  Jonathan M. Bull,et al.  A Multithreaded Java Grande Benchmark Suite , 2001 .

[4]  Stephen A. Edwards,et al.  Compile-Time Analysis and Specialization of Clocks in Concurrent Programs , 2009, CC.

[5]  Frederica Darema,et al.  A single-program-multiple-data computational model for EPEX/FORTRAN , 1988, Parallel Comput..

[6]  Vivek Sarkar,et al.  Chunking parallel loops in the presence of synchronization , 2009, ICS.

[7]  Vivek Sarkar,et al.  Reducing task creation and termination overhead in explicitly parallel programs , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[8]  Vivek Sarkar,et al.  Phasers: a unified deadlock-free construct for collective and point-to-point synchronization , 2008, ICS '08.

[9]  Vivek Sarkar,et al.  Hierarchical phasers for scalable synchronization and reductions in dynamic parallelism , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[10]  Vivek Sarkar Synchronization using counting semaphores , 1988, ICS '88.

[11]  Rajiv Gupta The fuzzy barrier: a mechanism for high speed synchronization of processors , 1989, ASPLOS III.

[12]  Chau-Wen Tseng,et al.  Compiler optimizations for eliminating barrier synchronization , 1995, PPOPP '95.

[13]  Jason Duell,et al.  Productivity and performance using partitioned global address space languages , 2007, PASCO '07.

[14]  J. Ramanujam,et al.  Parameterized tiling revisited , 2010, CGO '10.

[15]  Vivek Sarkar,et al.  Comparing the usability of library vs. language approaches to task parallelism , 2010, PLATEAU '10.

[16]  Barbara G. Ryder,et al.  Proceedings of the third ACM SIGPLAN conference on History of programming languages , 2007 .

[17]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[18]  David Holmes,et al.  Java Concurrency in Practice , 2006 .

[19]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .