Building Scalable PGAS Communication Subsystem on Blue Gene/Q

This paper presents a design of scalable Partitioned Global Address Space (PGAS) communication subsystems on recently proposed Blue Gene/Q architecture. The proposed design provides an in-depth modeling of communication infrastructure using Parallel Active Messaging Interface(PAMI). The communication infrastructure is used to design time-space efficient communication protocols for frequently used data-types (contiguous, uniformly non-contiguous) with Remote Direct Memory Access (RDMA) get/put primitives. The proposed design accelerates load balance counters by using asynchronous threads, which are required due to the missing network hardware support for generic Atomic Memory Operations (AMOs). Under the proposed design, the synchronization traffic is reduced by tracking conflicting memory accesses in distributed memory with a slight increment in space complexity. An evaluation with simple communication benchmarks show a adjacent node get latency of 2.89us and peak bandwidth of 1775 MB/s resulting in 99% communication efficiency. The evaluation shows a reduction in the execution time by up to 30% for NWChem self consistent field calculation on 4096 processes using the proposed asynchronous thread based design.

[1]  Bradford L. Chamberlain,et al.  Parallel Programmability and the Chapel Language , 2007, Int. J. High Perform. Comput. Appl..

[2]  Chris J. Scheiman,et al.  LogGP: incorporating long messages into the LogP model—one step closer towards a realistic model for parallel computation , 1995, SPAA '95.

[3]  David E. Bernholdt,et al.  High performance computational chemistry: An overview of NWChem a distributed parallel application , 2000 .

[4]  Amith R. Mamidala,et al.  Hot-Spot Avoidance With Multi-Pathing Over InfiniBand: An MPI Perspective , 2007, Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07).

[5]  Charles L. Seitz,et al.  Myrinet: A Gigabit-per-Second Local Area Network , 1995, IEEE Micro.

[6]  Dan Bonachea GASNet Specification, v1.1 , 2002 .

[7]  William Gropp,et al.  MPI-2: Extending the Message-Passing Interface , 1996, Euro-Par, Vol. I.

[8]  Abhinav Vishnu,et al.  Evaluating the Potential of Cray Gemini Interconnect for PGAS Communication Runtime Systems , 2011, 2011 IEEE 19th Annual Symposium on High Performance Interconnects.

[9]  Abhinav Vishnu,et al.  Designing scalable PGAS communication subsystems on cray gemini interconnect , 2012, 2012 19th International Conference on High Performance Computing.

[10]  Michael Gschwind,et al.  The IBM Blue Gene/Q Compute Chip , 2012, IEEE Micro.

[11]  Shuaiwen Song,et al.  Fault-tolerant communication runtime support for data-centric programming models , 2010, 2010 International Conference on High Performance Computing.

[12]  Katherine A. Yelick,et al.  A performance analysis of the Berkeley UPC compiler , 2003, ICS '03.

[13]  Amith R. Mamidala,et al.  PAMI: A Parallel Active Message Interface for the Blue Gene/Q Supercomputer , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[14]  Bryan Carpenter,et al.  ARMCI: A Portable Remote Memory Copy Libray for Ditributed Array Libraries and Compiler Run-Time Systems , 1999, IPPS/SPDP Workshops.

[15]  Burkhard D. Steinmacher-Burow,et al.  The IBM Blue Gene/Q Interconnection Fabric , 2012, IEEE Micro.

[16]  Robert J. Harrison,et al.  Global arrays: A nonuniform memory access programming model for high-performance computers , 1996, The Journal of Supercomputing.

[17]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[18]  Abhinav Vishnu,et al.  Efficient On-Demand Connection Management Mechanisms with PGAS Models over InfiniBand , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[19]  Gengbin Zheng,et al.  A uGNI-based Asynchronous Message-driven Runtime System for Cray Supercomputers with Gemini Interconnect , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[20]  Robert J. Harrison,et al.  Liquid water: obtaining the right answer for the right reasons , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[21]  Dhabaleswar K. Panda,et al.  Building Multirail InfiniBand Clusters: MPI-Level Design and Performance Evaluation , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[22]  Vivek Sarkar,et al.  Location Consistency-A New Memory Model and Cache Consistency Protocol , 2000, IEEE Trans. Computers.

[23]  Larry Kaplan,et al.  The Gemini System Interconnect , 2010, 2010 18th IEEE Symposium on High Performance Interconnects.

[24]  Wu-chun Feng,et al.  The Quadrics Network: High-Performance Clustering Technology , 2002, IEEE Micro.

[25]  Robert W. Numrich,et al.  Co-array Fortran for parallel programming , 1998, FORF.

[26]  Anthony Skjellum,et al.  A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..

[27]  Darius Buntinas,et al.  A uGNI-Based MPICH2 Nemesis Network Module for the Cray XE , 2011, EuroMPI.