Using shared-data localization to reduce the cost of inspector-execution in unified-parallel-C programs

We improve performance of fine-grain UPC applications by orders of magnitude.We introduce a novel shared-data localization transformation.We present a thorough performance analysis and evaluation.We show that reducing run-time calls is crucial for performance.We achieve performance comparable to C and MPI using the UPC programming model. Programs written in the Unified Parallel C (UPC) language can access any location of the entire local and remote address space via read/write operations. However, UPC programs that contain fine-grained shared accesses can exhibit performance degradation. One solution is to use the inspector-executor technique to coalesce fine-grained shared accesses to larger remote access operations. A straightforward implementation of the inspector-executor transformation results in excessive instrumentation that hinders performance.This paper addresses this issue and introduces various techniques that aim at reducing the generated instrumentation code: a shared-data localization transformation based on Constant-Stride Linear Memory Descriptors (CSLMADs) S. Aarseth, Gravitational N-Body Simulations: Tools and Algorithms, Cambridge Monographs on Mathematical Physics, Cambridge University Press, 2003., the inlining of data locality checks and the usage of an index vector to aggregate the data. Finally, the paper introduces a lightweight loop code motion transformation to privatize shared scalars that were propagated through the loop body.A performance evaluation, using up to 2048 cores of a POWER 775, explores the impact of each optimization and characterizes the overheads of UPC programs. It also shows that the presented optimizations increase performance of UPC programs up to 1.8 × their UPC hand-optimized counterpart for applications with regular accesses and up to 6.3 × for applications with irregular accesses.

[1]  Torsten Hoefler,et al.  The PERCS High-Performance Interconnect , 2010, 2010 18th IEEE Symposium on High Performance Interconnects.

[2]  Joel H. Saltz,et al.  Communication Optimizations for Irregular Scientific Computations on Distributed Memory Architectures , 1994, J. Parallel Distributed Comput..

[3]  Clifford Stein,et al.  Introduction to Algorithms, 2nd edition. , 2001 .

[4]  Shigeru Chiba,et al.  A New Optimization Technique for the Inspector-Executor Method , 2002, IASTED PDCS.

[5]  Jimmy Su,et al.  Automatic support for irregular computations in a high-level language , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[6]  Balaram Sinharoy,et al.  POWER7: IBM's next generation server processor , 2010, 2009 IEEE Hot Chips 21 Symposium (HCS).

[7]  Katherine A. Yelick,et al.  Titanium: A High-performance Java Dialect , 1998, Concurr. Pract. Exp..

[8]  John M. Mellor-Crummey,et al.  Effective communication coalescing for data-parallel applications , 2005, PPOPP.

[9]  Zhang Zhang,et al.  A UPC runtime system based on MPI and POSIX threads , 2006, 14th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP'06).

[10]  Edith Schonberg,et al.  A Unified Framework for Optimizing Communication in Data-Parallel Programs , 1996, IEEE Trans. Parallel Distributed Syst..

[11]  Robert W. Numrich,et al.  Co-array Fortran for parallel programming , 1998, FORF.

[12]  Ramakrishnan Rajamony,et al.  PERCS: The IBM POWER7-IH high-performance computing system , 2011, IBM J. Res. Dev..

[13]  Message P Forum,et al.  MPI: A Message-Passing Interface Standard , 1994 .

[14]  Mark N. Wegman,et al.  Efficiently computing static single assignment form and the control dependence graph , 1991, TOPL.

[15]  José Nelson Amaral,et al.  An unified parallel C compiler that implements automatic communication aggregation , 2009 .

[16]  Xunhao Li,et al.  Jit4OpenCL: a compiler from Python to OpenCL , 2011 .

[17]  José Nelson Amaral,et al.  Improving communication in PGAS environments: static and dynamic coalescing in UPC , 2013, ICS '13.

[18]  Sverre J. Aarseth,et al.  Gravitational N-Body Simulations , 2003 .

[19]  Yunheung Paek,et al.  Efficient and precise array access analysis , 2002, TOPL.

[20]  Victor Luchangco,et al.  The Fortress Language Specification Version 1.0 , 2007 .

[21]  José Nelson Amaral,et al.  Compiling Python to a hybrid execution environment , 2010, GPGPU-3.

[22]  Rafael Asenjo,et al.  Global Data Re-allocation via Communication Aggregation in Chapel , 2012, 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing.

[23]  José Nelson Amaral,et al.  Shared memory programming for large scale machines , 2006, PLDI '06.

[24]  José Nelson Amaral,et al.  Reducing Compiler-Inserted Instrumentation in Unified-Parallel-C Code Generation , 2014, 2014 IEEE 26th International Symposium on Computer Architecture and High Performance Computing.

[25]  Michail Alvanos,et al.  Memory Management Techniques for Exploiting RDMA in PGAS Languages , 2014, LCPC.

[26]  David A. Padua,et al.  Compiling for a Hybrid Programming Model Using the LMAD Representation , 2001, LCPC.

[27]  Michail Alvanos,et al.  Performance Analysis of the IBM XL UPC on the PERCS Architecture , 2013 .

[28]  Sverre J. Aarseth Gravitational N-Body Simulations: Tools and Algorithms , 2003 .

[29]  Xavier Martorell,et al.  Automatic communication coalescing for irregular computations in UPC language , 2012, CASCON.

[30]  Katherine A. Yelick,et al.  Communication optimizations for fine-grained UPC applications , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[31]  Peter Brezany,et al.  SVM Support in the Vienna Fortran Compilation System , 1994 .

[32]  Kemal Ebcioğlu,et al.  X 10 : Programming for Hierarchical Parallelism and Non-Uniform Data Access ( Extended , 2004 .

[33]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[34]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[35]  Daisuke Takahashi,et al.  The HPC Challenge (HPCC) benchmark suite , 2006, SC.

[36]  Charles Koelbel,et al.  Compiling Global Name-Space Parallel Loops for Distributed Execution , 1991, IEEE Trans. Parallel Distributed Syst..

[37]  Katherine Yelick,et al.  Optimizing partitioned global address space programs for cluster architectures , 2007 .

[38]  Tarek A. El-Ghazawi,et al.  UPC Performance and Potential: A NPB Experimental Study , 2002, ACM/IEEE SC 2002 Conference (SC'02).