Scaling communication-intensive applications on BlueGene/P using one-sided communication and overlap

In earlier work, we showed that the one-sided communication model found in PGAS languages (such as UPC) offers significant advantages in communication efficiency by decoupling data transfer from processor synchronization. We explore the use of the PGAS model on IBM BlueGene/P, an architecture that combines low-power, quad-core processors with extreme scalability. We demonstrate that the PGAS model, using a new port of the Berkeley UPC compiler and GASNet one-sided communication layer, outperforms two-sided (MPI) communication in both microbenchmarks and a case study of the communication-limited benchmark, NAS FT. We scale the benchmark up to 16,384 cores of the BlueGene/P and demonstrate that UPC consistently outperforms MPI by as much as 66% for some processor configurations and an average of 32%. In addition, the results demonstrate the scalability of the PGAS model and the Berkeley implementation of UPC, the viability of using it on machines with multicore nodes, and the effectiveness of the BG/P communication layer for supporting one-sided communication and PGAS languages.

[1]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[2]  Jack Dongarra,et al.  Special Issue on Program Generation, Optimization, and Platform Adaptation , 2005, Proc. IEEE.

[3]  Paul N. Swarztrauber,et al.  A comparison of optimal FFTs on torus and hypercube multicomputers , 2001, Parallel Comput..

[4]  José Nelson Amaral,et al.  Shared memory programming for large scale machines , 2006, PLDI '06.

[5]  Mohamed M. Zahran,et al.  Productivity analysis of the UPC language , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[6]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[7]  Message P Forum,et al.  MPI: A Message-Passing Interface Standard , 1994 .

[8]  D. Martin Swany,et al.  Transformations to Parallel Codes for Communication-Computation Overlap , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[9]  Philip Heidelberger,et al.  The deep computing messaging framework: generalized scalable message passing on the blue gene/P supercomputer , 2008, ICS '08.

[10]  Katherine A. Yelick,et al.  Optimizing bandwidth limited problems using one-sided communication and overlap , 2005, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[11]  C. Y. Chu,et al.  Comparison of two-dimensional FFT methods on the hypercube , 1989, C3P.

[12]  Dan Bonachea Proposal for extending the UPC memory copy library functions and suppo rting extensions to GASNet, Version 1.0 , 2004 .

[13]  Dan Bonachea Proposal for extending the upc memory copy library functions and supporting extensions to gasnet , 2004 .

[14]  George Almási,et al.  Performance without pain = productivity: data layout and collective communication in UPC , 2008, PPoPP.

[15]  Tarek A. El-Ghazawi,et al.  UPC Performance and Potential: A NPB Experimental Study , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[16]  Katherine Yelick,et al.  UPC Language Specifications V1.1.1 , 2003 .

[17]  Luis Díaz de Cerio,et al.  A Method for Exploiting Communication/Computation Overlap in Hypercubes , 1998, Parallel Comput..

[18]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.