Hybrid PGAS runtime support for multicore nodes

With multicore processors as the standard building block for high performance systems, parallel runtime systems need to provide excellent performance on shared memory, distributed memory, and hybrids. Conventional wisdom suggests that threads should be used as the runtime mechanism within shared memory, and two runtime versions for shared and distributed memory are often designed and implemented separately, retrofitting after the fact for hybrid systems. In this paper we consider the problem of implementing a runtime layer for Partitioned Global Address Space (PGAS) languages, which offer a uniform programming abstraction for hybrid machines. We present a new process-based shared memory runtime and compare it to our previous pthreads implementation. Both are integrated with the GASNet communication layer, and they can co-exist with one another. We evaluate the shared memory runtime approaches, showing that they interact in important and sometimes surprising ways with the communication layer. Using a set of microbenchmarks and application level benchmarks on an IBM BG/P, Cray XT, and InfiniBand cluster, we show that threads, processes and combinations of both are needed for maximum performance. Our new runtime shows speedups of over 60% for application benchmarks and 100% for collective communication benchmarks, when compared to the previous implementation. Our work primarily targets PGAS languages, but some of the lessons are relevant to other parallel runtime systems and libraries.

[1]  Jason Duell Pthreads or Processes : Which is Better for Implementing Global Address Space languages ? , 2007 .

[2]  Bradford L. Chamberlain,et al.  The cascade high productivity language , 2004, Ninth International Workshop on High-Level Parallel Programming Models and Supportive Environments, 2004. Proceedings..

[3]  Robert W. Numrich,et al.  Co-array Fortran for parallel programming , 1998, FORF.

[4]  Samuel Williams,et al.  Memory-efficient optimization of Gyrokinetic particle-to-grid interpolation for multicore processors , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[5]  Dan Bonachea GASNet Specification, v1.1 , 2002 .

[6]  Katherine A. Yelick,et al.  Titanium: A High-performance Java Dialect , 1998, Concurr. Pract. Exp..

[7]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[8]  Jack Dongarra,et al.  MPI: The Complete Reference , 1996 .

[9]  Katherine A. Yelick,et al.  Communication optimizations for fine-grained UPC applications , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[10]  Tarek A. El-Ghazawi,et al.  UPC Performance and Potential: A NPB Experimental Study , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[11]  Xiaoning Ding,et al.  Multigrain parallel Delaunay Mesh generation: challenges and opportunities for multithreaded architectures , 2005, ICS '05.

[12]  Jack Dongarra,et al.  MPI - The Complete Reference: Volume 1, The MPI Core , 1998 .

[13]  Katherine A. Yelick,et al.  Automatic , 2013, Definitions.

[14]  Seth Copen Goldstein,et al.  Active messages: a mechanism for integrating communication and computation , 1998, ISCA '98.

[15]  Jason Duell,et al.  Productivity and performance using partitioned global address space languages , 2007, PASCO '07.

[16]  Mohamed M. Zahran,et al.  Productivity analysis of the UPC language , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[17]  Wei Chen,et al.  Performance Portable Optimizations for Loops Containing Communication Operations , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[18]  George Almási,et al.  Performance without pain = productivity: data layout and collective communication in UPC , 2008, PPoPP.

[19]  Robert Hood,et al.  A practical study of UPC using the NAS Parallel Benchmarks , 2009, PGAS '09.

[20]  Laxmikant V. Kalé,et al.  Understanding Application Performance via Micro-benchmarks on Three Large Supercomputers: Intrepid, Ranger and Jaguar , 2010, Int. J. High Perform. Comput. Appl..

[21]  Katherine Yelick,et al.  Introduction to UPC and Language Specification , 2000 .

[22]  Keith D. Underwood,et al.  Evaluating NIC hardware requirements to achieve high message rate PGAS support on multi-core processors , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).