Enabling a highly-scalable global address space model for petascale computing

Over the past decade, the trajectory to the petascale has been built on increased complexity and scale of the underlying parallel architectures. Meanwhile, software developers have struggled to provide tools that maintain the productivity of computational science teams using these new systems. In this regard, Global Address Space (GAS) programming models provide a straightforward and easy to use addressing model, which can lead to improved productivity. However, the scalability of GAS depends directly on the design and implementation of the runtime system on the target petascale distributed-memory architecture. In this paper, we describe the design, implementation, and optimization of the Aggregate Remote Memory Copy Interface (ARMCI) runtime library on the Cray XT5 2.3 PetaFLOPs computer at Oak Ridge National Laboratory. We optimized our implementation with the flow intimation technique that we have introduced in this paper. Our optimized ARMCI implementation improves scalability of both the Global Arrays (GA) programming model and a real-world chemistry application - NWChem - from small jobs up through 180,000 cores.

[1]  Jarek Nieplocha,et al.  Evaluation of Remote Memory Access Communication on the Cray XT3 , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[2]  Dan Bonachea GASNet Specification, v1.1 , 2002 .

[3]  Theresa L Windus,et al.  Thermodynamic properties of the C5, C6, and C8 n-alkanes from ab initio electronic structure theory. , 2005, The journal of physical chemistry. A.

[4]  David E. Woon,et al.  Gaussian basis sets for use in correlated molecular calculations. IV. Calculation of static electrical response properties , 1994 .

[5]  Dhabaleswar K. Panda,et al.  Host-assisted zero-copy remote memory access communication on InfiniBand , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[6]  Robert J. Harrison,et al.  Asynchronous Programming in UPC: A Case Study and Potential for Improvement , 2009 .

[7]  Dhabaleswar K. Panda,et al.  Protocols and strategies for optimizing performance of remote memory operations on clusters , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[8]  J. Mellor-Crummey,et al.  A multi-platform co-array Fortran compiler , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[9]  Rolf Riesen,et al.  Portals 3.0: protocol building blocks for low overhead communication , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[10]  T. H. Dunning Gaussian basis sets for use in correlated molecular calculations. I. The atoms boron through neon and hydrogen , 1989 .

[11]  Robyn R. Lutz,et al.  Generalized portable shmem library for high performance computing , 2003 .

[12]  Dhabaleswar K. Panda,et al.  High Performance Remote Memory Access Communication: The Armci Approach , 2006, Int. J. High Perform. Comput. Appl..

[13]  Jarek Nieplocha,et al.  Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit , 2006, Int. J. High Perform. Comput. Appl..

[14]  Robert J. Harrison,et al.  Liquid water: obtaining the right answer for the right reasons , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[15]  V. Tipparaju,et al.  Optimizing strided remote memory access operations on the Quadrics QsNetII network interconnect , 2005, Eighth International Conference on High-Performance Computing in Asia-Pacific Region (HPCASIA'05).

[16]  Jarek Nieplocha,et al.  An evaluation of two implementation strategies for optimizing one-sided atomic reduction , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[17]  Brian W. Barrett,et al.  Analysis of Implementation Options for MPI-2 One-Sided , 2007, PVM/MPI.

[18]  Alistair P. Rendell,et al.  A direct coupled cluster algorithm for massively parallel computers , 1997 .