Scalable Job Startup and Inter-Node Communication in Multi-Core InfiniBand Clusters

Moores Law – frequency scaling and exploitation of Instruction Level Parallelism due to increasing transistor density no longer leads performance gains in modern systems due to limitations in power dissipation. This has led to an increased focus on deriving performance gains by taking advantage of Data Level Parallelism through parallel computing. Clusters – groups of commodity compute nodes connected via a modern interconnect have emerged as the top supercomputers in the World and the Message Passing Interface (MPI) has emerged as the de facto standard in parallel processing models on large clusters. Scientific and financial applications have ever increasing demands for compute cycles and the emergence of multi-core processors has driven an enormous growth in the cluster sizes in recent years. InfiniBand has emerged as a popular low-latency, high-bandwidth interconnect of choice in these large clusters. With cluster sizes continuing to scale, the scalability of MPI libraries and associated system support and resources such as the job launcher have been at the center of attention in the High Performance Computing (HPC) community. In this work we examine the current job launching mechanisms that have scalability problems on large scale clusters due to resource constraints as well as performance bottlenecks. We propose a Scalable and Extensible Launching Architecture for Clusters (ScELA) that scales to modern clusters such as the 64K processor TACC Ranger. ii We also examine the scalability constraints with point-to-point InfiniBand channels in MPI libraries. We use the eXtended Reliable Connection (XRC) transport available in recent InfiniBand adapters to design a scalable MPI communication channel with a smaller memory footprint. The designs proposed in this work are available in both MVAPICH and MVAPICH2 MPI libraries over InfiniBand, which are used by more than nine hundred organizations around the World.

[1]  G.E. Moore,et al.  Cramming More Components Onto Integrated Circuits , 1998, Proceedings of the IEEE.

[2]  Dhabaleswar K. Panda,et al.  MVAPICH-Aptus: Scalable high-performance multi-transport MPI over InfiniBand , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[3]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[4]  Dhabaleswar K. Panda,et al.  Designing High Performance and Scalable MPI Intra-node Communication Support for Clusters , 2006, 2006 IEEE International Conference on Cluster Computing.

[5]  Sayantan Sur,et al.  Shared receive queue based scalable MPI design for InfiniBand clusters , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[6]  Dhabaleswar K. Panda,et al.  Adaptive connection management for scalable MPI over InfiniBand , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[7]  A. Shukla,et al.  TCP Connection Management Mechanisms for Improving Internet Server Performance , 2006, 2006 1st IEEE Workshop on Hot Topics in Web Systems and Technologies.

[8]  Sayantan Sur,et al.  High performance MPI design using unreliable datagram for ultra-scale InfiniBand clusters , 2007, ICS '07.

[9]  D. Panda,et al.  Reducing Connection Memory Requirements of MPI for InfiniBand Clusters: A Message Coalescing Approach , 2007, Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07).

[10]  Sayantan Sur,et al.  Zero-copy protocol for MPI using infiniband unreliable datagram , 2007, 2007 IEEE International Conference on Cluster Computing.

[11]  M. Jette,et al.  Simple Linux Utility for Resource Management , 2009 .

[12]  Torsten Hoefler,et al.  A Case for Non-blocking Collective Operations , 2006, ISPA Workshops.

[13]  Hans Werner Meuer,et al.  Top500 Supercomputer Sites , 1997 .

[14]  Ron Brightwell,et al.  Scalable parallel application launch on Cplant , 2001, SC.

[15]  Wei Huang,et al.  Design of High Performance MVAPICH2: MPI2 over InfiniBand , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).

[16]  Galen M. Shipman,et al.  Infiniband scalability in Open MPI , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[17]  D.K. Panda,et al.  Scalable NIC-based Reduction on Large-scale Clusters , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[18]  William Gropp,et al.  Components and interfaces of a process management system for parallel programs , 2001, Parallel Comput..

[19]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[20]  Katherine Yelick,et al.  Introduction to UPC and Language Specification , 2000 .