Runtime Techniques to Enable a Highly-Scalable Global Address Space Model for Petascale Computing

Over the past decade, the trajectory to the petascale has been built on increased complexity and scale of the underlying parallel architectures. Meanwhile, software developers have struggled to provide tools that maintain the productivity of computational science teams using these new systems. In this regard, Global Address Space (GAS) programming models provide a straightforward and easy to use addressing model, which can lead to improved productivity. However, the scalability of GAS depends directly on the design and implementation of the runtime system on the target petascale distributed-memory architecture. In this paper, we describe the design, implementation, and optimization of the Aggregate Remote Memory Copy Interface (ARMCI) runtime library on the Cray XT5 2.3 PetaFLOPs computer at Oak Ridge National Laboratory. We optimized our implementation with the flow intimation technique that we have introduced in this paper. Our optimized ARMCI implementation improves scalability of both the Global Arrays programming model and a real-world chemistry application—NWChem—from small jobs up through 180,000 cores.

[1]  Dhabaleswar K. Panda,et al.  High Performance Remote Memory Access Communication: The Armci Approach , 2006, Int. J. High Perform. Comput. Appl..

[2]  Dhabaleswar K. Panda,et al.  Protocols and strategies for optimizing performance of remote memory operations on clusters , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[3]  Suzanne M. Kelly,et al.  Software Architecture of the Light Weight Kernel, Catamount , 2005 .

[4]  Thom H. Dunning,et al.  Gaussian basis sets for use in correlated molecular calculations. V. Core-valence basis sets for boron through neon , 1995 .

[5]  Robyn R. Lutz,et al.  Generalized portable shmem library for high performance computing , 2003 .

[6]  Jarek Nieplocha,et al.  An evaluation of two implementation strategies for optimizing one-sided atomic reduction , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[7]  John M. Mellor-Crummey,et al.  A Multi-Platform Co-Array Fortran Compiler , 2004, IEEE PACT.

[8]  Brian W. Barrett,et al.  Analysis of Implementation Options for MPI-2 One-Sided , 2007, PVM/MPI.

[9]  Philip Heidelberger,et al.  The deep computing messaging framework: generalized scalable message passing on the blue gene/P supercomputer , 2008, ICS '08.

[10]  Alistair P. Rendell,et al.  A direct coupled cluster algorithm for massively parallel computers , 1997 .

[11]  Dhabaleswar K. Panda,et al.  Host-assisted zero-copy remote memory access communication on InfiniBand , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[12]  Dhabaleswar K. Panda,et al.  Scalable MPI design over InfiniBand using eXtended Reliable Connection , 2008, 2008 IEEE International Conference on Cluster Computing.

[13]  Theresa L Windus,et al.  Thermodynamic properties of the C5, C6, and C8 n-alkanes from ab initio electronic structure theory. , 2005, The journal of physical chemistry. A.

[14]  Brian E. Smith,et al.  Evaluation of Remote Memory Access Communication on the IBM Blue Gene/P Supercomputer , 2008, 2008 International Conference on Parallel Processing - Workshops.

[15]  Dan Bonachea,et al.  A new DMA registration strategy for pinning-based high performance networks , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[16]  Jarek Nieplocha,et al.  Evaluation of Remote Memory Access Communication on the Cray XT3 , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[17]  T. H. Dunning Gaussian basis sets for use in correlated molecular calculations. I. The atoms boron through neon and hydrogen , 1989 .

[18]  Katherine A. Yelick,et al.  Scaling communication-intensive applications on BlueGene/P using one-sided communication and overlap , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[19]  Robert J. Harrison,et al.  Asynchronous Programming in UPC: A Case Study and Potential for Improvement , 2009 .

[20]  Jarek Nieplocha,et al.  One-sided communication on the myrinet-based SMP clusters using the GM message-passing library , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[21]  V. Tipparaju,et al.  Optimizing strided remote memory access operations on the Quadrics QsNetII network interconnect , 2005, Eighth International Conference on High-Performance Computing in Asia-Pacific Region (HPCASIA'05).

[22]  M. Blocksome,et al.  Design and Implementation of a One-Sided Communication Interface for the IBM eServer Blue Gene , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[23]  Vivek Sarkar,et al.  Report on the Experimental Language X10 , 2006 .

[24]  Dan Bonachea GASNet Specification, v1.1 , 2002 .

[25]  Jarek Nieplocha,et al.  Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit , 2006, Int. J. High Perform. Comput. Appl..

[26]  Robert J. Harrison,et al.  Liquid water: obtaining the right answer for the right reasons , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[27]  Rolf Riesen,et al.  Portals 3.0: protocol building blocks for low overhead communication , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.