Languages and Compilers for Parallel Computing

OpenMP is an explicit parallel programming model that offers reasonable productivity. Its memory model assumes a shared address space, and hence the direct translation as done by common OpenMP compilers requires an underlying shared-memory architecture. Many lab machines include 10s of processors, built from commodity components and thus include distributed address spaces. Despite many efforts to provide higher productivity for these platforms, the most common programming model uses message passing, which is substantially more tedious to program than shared-address-space models. This paper presents a compiler/runtime system that translates OpenMP programs into message passing variants and executes them on clusters up to 64 processors. We build on previous work that provided a proof of concept of such translation. The present paper describes compiler algorithms and runtime techniques that provide the automatic translation of a first class of OpenMP applications: those that exhibit regular write array subscripts and repetitive communication. We evaluate the translator on representative benchmarks of this class and compare their performance against hand-written MPI variants. In all but one case, our translated versions perform close to the hand-written variants.

[1]  Rosa M. Badia,et al.  CellSs: a Programming Model for the Cell BE Architecture , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[2]  Michael M. Swift,et al.  Pathological Interaction of Locks with Transactional Memory , 2008 .

[3]  Yannis Smaragdakis,et al.  Adaptive Locks: Combining Transactions and Locks for Efficient Concurrency , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[4]  Suresh Jagannathan,et al.  A Uniform Transactional Execution Environment for Java , 2008, ECOOP.

[5]  Scott A. Mahlke,et al.  Dynamic parallelization of JavaScript applications using an ultra-lightweight speculation mechanism , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[6]  Nir Shavit,et al.  Software transactional memory , 1995, PODC '95.

[7]  Adam Welc,et al.  Irrevocable transactions and their applications , 2008, SPAA '08.

[8]  Mason Chang,et al.  Trace-based just-in-time type specialization for dynamic languages , 2009, PLDI '09.

[9]  Michael I. Gordon,et al.  Exploiting coarse-grained task, data, and pipeline parallelism in stream programs , 2006, ASPLOS XII.

[10]  Philip A. Bernstein,et al.  Atomic Transactional Execution in Hardware: A New High-Performance Abstraction for Databases? , 2003 .

[11]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[12]  Scott A. Mahlke,et al.  Dynamically accelerating client-side web applications through decoupled execution , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[13]  Ravi Rajwar,et al.  Speculative lock elision: enabling highly concurrent multithreaded execution , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[14]  Guang R. Gao,et al.  TiNy threads: a thread virtual machine for the Cyclops64 cellular architecture , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[15]  Yi Jiang,et al.  Toward an Automatic Code Layout Methodology , 2007, IWOMP.

[16]  Vivek Sarkar,et al.  A Study of a Software Cache Implementation of the OpenMP Memory Model for Multicore and Manycore Architectures , 2010, Euro-Par.

[17]  Edward A. Lee,et al.  Static Scheduling of Synchronous Data Flow Programs for Digital Signal Processing , 1989, IEEE Transactions on Computers.

[18]  Michael F. Spear,et al.  Conflict Detection and Validation Strategies for Software Transactional Memory , 2006, DISC.

[19]  Maged M. Michael,et al.  Inevitability Mechanisms for Software Transactional Memory , 2008 .

[20]  David Flint,et al.  Challenges to Providing Performance Isolation in Transactional Memories , 2005 .

[21]  Jon Howell,et al.  Crom: Faster Web Browsing Using Speculative Execution , 2010, NSDI.

[22]  Guang R. Gao,et al.  Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures , 2007, ISCA '07.

[23]  Michael Wolfe,et al.  Implementing the PGI Accelerator model , 2010, GPGPU-3.

[24]  Donald E. Porter,et al.  TxLinux: using and managing hardware transactional memory in an operating system , 2007, SOSP.