Evaluating the Potential of Cray Gemini Interconnect for PGAS Communication Runtime Systems

The Cray Gemini Interconnect has been recently introduced as the next generation network for building scalable multi-petascale supercomputers. The Cray XE6 systems, which use the Gemini Interconnect are becoming available with Message Passing Interface (MPI) and Partitioned Global Address Space (PGAS) Models such as as Global Arrays, Unified Parallel C, Co-Array Fortran and Cascade High Performance Language. These PGAS models use one-sided communication runtime systems such as MPI-Remote Memory Access, Aggregate Remote Memory Copy Interface and proprietary communication runtime systems. The primary objective of our work is to study the potential of Cray Gemini Interconnect by designing application specific micro-benchmarks using the DMAPP user space library. We design micro-benchmarks to study the performance of simple communication primitives and application specific micro-benchmarks to understand the behavior of Gemini Interconnect at scale. In our experiments, the Gemini Interconnect can achieve a peak bandwidth of 6911 MB/s and a latency of 1us for get communication primitive. Scalability tests for atomic memory operations and shift communication operation up to 65536 processes show the efficacy of the Gemini Interconnect.

[1]  Larry Kaplan,et al.  The Gemini System Interconnect , 2010, 2010 18th IEEE Symposium on High Performance Interconnects.

[2]  Bradford L. Chamberlain,et al.  Parallel Programmability and the Chapel Language , 2007, Int. J. High Perform. Comput. Appl..

[3]  Bryan Carpenter,et al.  ARMCI: A Portable Remote Memory Copy Libray for Ditributed Array Libraries and Compiler Run-Time Systems , 1999, IPPS/SPDP Workshops.

[4]  Robert J. Harrison,et al.  Global arrays: A nonuniform memory access programming model for high-performance computers , 1996, The Journal of Supercomputing.

[5]  Keith D. Underwood,et al.  SeaStar Interconnect: Balanced Bandwidth for Scalable Performance , 2006, IEEE Micro.

[6]  William Gropp,et al.  MPI-2: Extending the Message-Passing Interface , 1996, Euro-Par, Vol. I.

[7]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[8]  Peter H. Hochschild,et al.  Breaking the connection: RDMA deconstructed , 2005, 13th Symposium on High Performance Interconnects (HOTI'05).

[9]  Chulho Kim,et al.  Architecture and Early Performance of the New IBM HPS Fabric and Adapter , 2004, HiPC.

[10]  Dhabaleswar K. Panda,et al.  Microbenchmark performance comparison of high-speed cluster interconnects , 2004, IEEE Micro.

[11]  Katherine A. Yelick,et al.  A performance analysis of the Berkeley UPC compiler , 2003, ICS '03.

[12]  Sayantan Sur,et al.  Can memory-less network adapters benefit next-generation infiniband systems? , 2005, 13th Symposium on High Performance Interconnects (HOTI'05).

[13]  David E. Bernholdt,et al.  High performance computational chemistry: An overview of NWChem a distributed parallel application , 2000 .

[14]  Amith R. Mamidala,et al.  Hot-Spot Avoidance With Multi-Pathing Over InfiniBand: An MPI Perspective , 2007, Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07).

[15]  Dhabaleswar K. Panda,et al.  High Performance MPI on IBM 12x InfiniBand Architecture , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[16]  Amith R. Mamidala,et al.  Performance modeling of subnet management on fat tree InfiniBand networks using OpenSM , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[17]  Wu-chun Feng,et al.  The Quadrics Network: High-Performance Clustering Technology , 2002, IEEE Micro.

[18]  Robert W. Numrich,et al.  Co-array Fortran for parallel programming , 1998, FORF.

[19]  Anthony Skjellum,et al.  A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..

[20]  Charles L. Seitz,et al.  Myrinet: A Gigabit-per-Second Local Area Network , 1995, IEEE Micro.