HiCOO: Hierarchical cooperation for scalable communication in Global Address Space programming models on Cray XT systems

Global Address Space (GAS) programming models enable a convenient, shared-memory style addressing model. Typically this is realized through one-sided operations that can enable asynchronous communication and data movement. With the size of petascale systems reaching 10,000s of nodes and 100,000s of cores, the underlying runtime systems face critical challenges in (1) scalably managing resources (such as memory for communication buffers), and (2) gracefully handling unpredictable communication patterns and any associated contention. For any solution that addresses these resource scalability challenges, equally important is the need to maintain the performance of GAS programming models. In this paper, we describe a Hierarchical COOperation (HiCOO) architecture for scalable communication in GAS programming models. HiCOO formulates a cooperative communication architecture: with inter-node cooperation amongst multiple nodes (a.k.a multinode) and hierarchical cooperation among multinodes that are arranged in various virtual topologies. We have implemented HiCOO for a popular GAS runtime library, Aggregate Remote Memory Copy Interface (ARMCI). By extensively evaluating different virtual topologies in HiCOO in terms of their impact to memory scalability, network contention, and application performance, we identify MFCG as the most suitable virtual topology. The resulting HiCOO architecture is able to realize scalable resource management and achieve resilience to network contention, while at the same time maintaining or enhancing the performance of scientific applications. In one case, it reduces the total execution time of an NWChem application by 52%.

[1]  F. Leighton,et al.  Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes , 1991 .

[2]  David H. Bailey,et al.  NAS parallel benchmark results , 1992, Proceedings Supercomputing '92.

[3]  Fabrizio Petrini,et al.  k-ary n-trees: high performance networks for massively parallel architectures , 1997, Proceedings 11th International Parallel Processing Symposium.

[4]  Xiaola Lin,et al.  Deadlock-free multicast wormhole routing in multicomputer networks , 1991, ISCA '91.

[5]  Frank Thomson Leighton Introduction to parallel algorithms and architectures: arrays , 1992 .

[6]  Rolf Riesen,et al.  Design, Implementation, and Performance of MPI on Portals 3.0 , 2003, Int. J. High Perform. Comput. Appl..

[7]  Xian-He Sun,et al.  Reevaluating Amdahl's law in the multicore era , 2010, J. Parallel Distributed Comput..

[8]  Katherine Yelick,et al.  Porting GASNet to Portals: Partitioned Global Address Space (PGAS) Language Support for the Cray XT , 2009 .

[9]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[10]  Dhabaleswar K. Panda,et al.  High Performance Remote Memory Access Communication: The Armci Approach , 2006, Int. J. High Perform. Comput. Appl..

[11]  Sudhakar Yalamanchili,et al.  Interconnection Networks: An Engineering Approach , 2002 .

[12]  Jeffrey S. Vetter,et al.  Enabling a highly-scalable global address space model for petascale computing , 2010, CF '10.

[13]  Katherine A. Yelick,et al.  Communication optimizations for fine-grained UPC applications , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[14]  William J. Dally,et al.  Performance Analysis of k-Ary n-Cube Interconnection Networks , 1987, IEEE Trans. Computers.

[15]  Collin McCurdy,et al.  Early evaluation of IBM BlueGene/P , 2008, HiPC 2008.

[16]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[17]  William J. Dally,et al.  Deadlock-Free Message Routing in Multiprocessor Interconnection Networks , 1987, IEEE Transactions on Computers.

[18]  Matthew J. Koop,et al.  High-Performance and Scalable MPI over InfiniBand with Reduced Memory Usage: An In-Depth performance Analysis , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[19]  David E. Bernholdt,et al.  High performance computational chemistry: An overview of NWChem a distributed parallel application , 2000 .

[20]  Lionel M. Ni,et al.  The Turn Model for Adaptive Routing , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[21]  Dhabaleswar K. Panda Fast barrier synchronization in wormhole k-ary n-cube networks with multidestination worms , 1995, Future Gener. Comput. Syst..

[22]  José Duato A Theory of Deadlock-Free Adaptive Multicast Routing in Wormhole Networks , 1995, IEEE Trans. Parallel Distributed Syst..

[23]  Dhabaleswar K. Panda,et al.  Efficient one-copy MPI shared memory communication in Virtual Machines , 2008, 2008 IEEE International Conference on Cluster Computing.

[24]  Robert J. Harrison,et al.  Liquid water: obtaining the right answer for the right reasons , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[25]  D. Panda,et al.  Reducing Connection Memory Requirements of MPI for InfiniBand Clusters: A Message Coalescing Approach , 2007, Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07).

[26]  Jim Euchner Design , 2014, Catalysis from A to Z.

[27]  J. Mellor-Crummey,et al.  A multi-platform co-array Fortran compiler , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..