Performance Optimization and Modeling of Fine-Grained Irregular Communication in UPC

The Unified Parallel C (UPC) programming language offers parallelism via logically partitioned shared memory, which typically spans physically disjoint memory subsystems. One convenient feature of UPC is its ability to automatically execute between-thread data movement, such that the entire content of a shared data array appears to be freely accessible by all the threads. The programmer friendliness, however, can come at the cost of substantial performance penalties. This is especially true when indirectly indexing the elements of a shared array, for which the induced between-thread data communication can be irregular and have a fine-grained pattern. In this paper, we study performance enhancement strategies specifically targeting such fine-grained irregular communication in UPC. Starting from explicit thread privatization, continuing with block-wise communication, and arriving at message condensing and consolidation, we obtained considerable performance improvement of UPC programs that originally require fine-grained irregular communication. Besides the performance enhancement strategies, the main contribution of the present paper is to propose performance models for the different scenarios, in the form of quantifiable formulas that hinge on the actual volumes of various data movements plus a small number of easily obtainable hardware characteristic parameters. These performance models help to verify the enhancements obtained, while also providing insightful predictions of similar parallel implementations, not limited to UPC, that also involve between-thread or between-process irregular communication. As a further validation, we also apply our performance modeling methodology and hardware characteristic parameters to an existing UPC code for solving a 2D heat equation on a uniform mesh.

[1]  Stefan Marr,et al.  Partitioned Global Address Space Languages , 2015, ACM Comput. Surv..

[2]  Katherine Yelick,et al.  Optimizing partitioned global address space programs for cluster architectures , 2007 .

[3]  T. von Eicken,et al.  Parallel programming in Split-C , 1993, Supercomputing '93.

[4]  Tarek A. El-Ghazawi,et al.  UPC Performance and Potential: A NPB Experimental Study , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[5]  Tarek A. El-Ghazawi,et al.  Fast address translation techniques for distributed shared memory compilers , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[6]  Katherine Yelick,et al.  UPC: Distributed Shared-Memory Programming , 2003 .

[7]  George Almási PGAS (Partitioned Global Address Space) Languages , 2011, Encyclopedia of Parallel Computing.

[8]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[9]  Changjun Hu,et al.  Automatic tuning of sparse matrix-vector multiplication on multicore clusters , 2015, Science China Information Sciences.

[10]  Gerhard Wellein,et al.  Quantifying Performance Bottlenecks of Stencil Computations Using the Execution-Cache-Memory Model , 2014, ICS.

[11]  Alexander Ostermann,et al.  Evaluation of the partitioned global address space (PGAS) model for an inviscid Euler solver , 2016, Parallel Comput..

[12]  Katherine A. Yelick,et al.  Communication optimizations for fine-grained UPC applications , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[13]  Hang Si,et al.  TetGen, a Delaunay-Based Quality Tetrahedral Mesh Generator , 2015, ACM Trans. Math. Softw..

[14]  Yili Zheng Optimizing UPC programs for multi-core systems , 2010 .

[15]  Michail Alvanos,et al.  Optimization techniques for fine-grained communication in PGAS environments , 2013 .

[16]  Nicholas J. Wright,et al.  A programming model performance study using the NAS parallel benchmarks , 2010 .

[17]  Zhang Zhang,et al.  Benchmark measurements of current UPC platforms , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[18]  José Nelson Amaral,et al.  A Characterization of Shared Data Access Patterns in UPC Programs , 2006, LCPC.

[19]  Scott B. Baden,et al.  Scalable Heterogeneous CPU-GPU Computations for Unstructured Tetrahedral Meshes , 2015, IEEE Micro.

[20]  Nan Wu,et al.  Parallel performance modeling of irregular applications in cell-centered finite volume methods over unstructured tetrahedral meshes , 2015, J. Parallel Distributed Comput..

[21]  Andrea C. Arpaci-Dusseau,et al.  Parallel programming in Split-C , 1993, Supercomputing '93. Proceedings.

[22]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[23]  Katherine A. Yelick,et al.  An Evaluation of One-Sided and Two-Sided Communication Paradigms on Relaxed-Ordering Interconnect , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[24]  Ami Marowka Execution model of three parallel languages: OpenMP, UPC and CAF , 2005, Sci. Program..

[25]  Juan Touriño,et al.  Performance evaluation of sparse matrix products in UPC , 2013, The Journal of Supercomputing.

[26]  Zhang Zhang,et al.  A performance model for fine-grain accesses in UPC , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.