Scalable and high-performance mpi design for very large infiniband clusters

In the past decade, rapid advances have taken place in the field of computer and network design enabling us to connect thousands of computers together to form high-performance clusters. These clusters are used to solve computationally challenging scientific problems. The Message Passing Interface (MPI) is a popular model to write applications for these clusters. There are a vast array of scientific applications which use MPI on clusters. As the applications operate on larger and more complex data, the size of the compute clusters is scaling higher and higher. Thus, in order to enable the best performance to these scientific applications, it is very critical for the design of the MPI libraries be extremely scalable and high-performance. InfiniBand is a cluster interconnect which is based on open-standards and gaining rapid acceptance. This dissertation presents novel designs based on the new features offered by InfiniBand, in order to design scalable and high-performance MPI libraries for large-scale clusters with tens-of-thousands of nodes. Methods developed in this dissertation have been applied towards reduction in overall resource consumption, increased overlap of computation and communication, improved performance of collective operations and finally designing application-level benchmarks to make efficient use of modern networking technology. Software developed as a part of this dissertation is available in MVAPICH, which is a popular open-source implementation of MPI over InfiniBand and is used by several hundred top computing sites all around the world.

[1]  John Rischard Rice,et al.  Mathematical Software , 1971 .

[2]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[3]  Charles L. Seitz,et al.  Myrinet: A Gigabit-per-Second Local Area Network , 1995, IEEE Micro.

[4]  Anthony Skjellum,et al.  A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..

[5]  Eli Upfal,et al.  Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems , 1997, IEEE Trans. Parallel Distributed Syst..

[6]  Alfred E. Brenner,et al.  Moore's Law , 1997, Science.

[7]  スタンフィル,クレッグ,et al.  Parallel virtual file system , 1998 .

[8]  Wu-chun Feng,et al.  The Quadrics Network: High-Performance Clustering Technology , 2002, IEEE Micro.

[9]  Laxmikant V. Kalé,et al.  NAMD: Biomolecular Simulation on Thousands of Processors , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[10]  Rajeev Thakur,et al.  Improving the Performance of Collective Operations in MPICH , 2003, PVM/MPI.

[11]  Dhabaleswar K. Panda,et al.  High Performance RDMA-Based MPI Implementation over InfiniBand , 2003, ICS '03.

[12]  James Demmel,et al.  SuperLU_DIST: A scalable distributed-memory sparse direct solver for unsymmetric linear systems , 2003, TOMS.

[13]  D. Panda,et al.  Implementing efficient and scalable flow control schemes in MPI over InfiniBand , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[14]  Dhabaleswar K. Panda,et al.  Designing high performance and scalable mpi over infiniband , 2004 .

[15]  George Bosilca,et al.  Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.

[16]  Sayantan Sur,et al.  Efficient and scalable all-to-all personalized exchange for InfiniBand-based clusters , 2004 .

[17]  Sayantan Sur,et al.  High Performance Broadcast Support in La-Mpi Over Quadrics , 2005, Int. J. High Perform. Comput. Appl..

[18]  Sayantan Sur,et al.  LiMIC: support for high-performance MPI intra-node communication on Linux cluster , 2005, 2005 International Conference on Parallel Processing (ICPP'05).

[19]  Sayantan Sur,et al.  Analysis of design considerations for optimizing multi-channel MPI over InfiniBand , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[20]  Leonid Oliker,et al.  Analyzing Ultra-Scale Application Communication Requirements for a Reconfigurable Hybrid Interconnect , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[21]  Sayantan Sur,et al.  Can memory-less network adapters benefit next-generation infiniband systems? , 2005, 13th Symposium on High Performance Interconnects (HOTI'05).

[22]  D. Panda,et al.  High Performance RDMA Based All-to-All Broadcast for InfiniBand Clusters , 2005, HiPC.

[23]  Sayantan Sur,et al.  Shared receive queue based scalable MPI design for InfiniBand clusters , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[24]  Matthew J. Koop,et al.  High-Performance and Scalable MPI over InfiniBand with Reduced Memory Usage: An In-Depth performance Analysis , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[25]  Sayantan Sur,et al.  RDMA read based rendezvous protocol for MPI over InfiniBand: design alternatives and benefits , 2006, PPoPP '06.

[26]  Sayantan Sur,et al.  Zero-copy protocol for MPI using infiniband unreliable datagram , 2007, 2007 IEEE International Conference on Cluster Computing.

[27]  Sayantan Sur,et al.  High performance MPI design using unreliable datagram for ultra-scale InfiniBand clusters , 2007, ICS '07.

[28]  Sayantan Sur,et al.  Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms , 2007 .