Topology-Aware MPI Communication and Scheduling for High Performance Computing Systems

Most of the traditional High End Computing (HEC) applications and current petascale applications are written using the Message Passing Interface (MPI) programming model. Consequently, MPI communication primitives (both point to point and collectives) are extensively used across various scientific and HEC applications. The large-scale HEC systems on which these applications run, by necessity, are designed with multiple layers of switches with different topologies like fat-trees (with different kinds of over-subscription), meshes, torus, etc. Hence, the performance of an MPI library, and in turn the applications, is heavily dependent upon how the MPI library has been designed and optimized to take the system architecture (processor, memory, network interface, and network topology) into account. In addition, parallel jobs are typically submitted to such systems through schedulers (such as PBS and SLURM). Currently, most schedulers do not have the intelligence to allocate compute nodes to MPI tasks based on the underlying topology of the system and the communication requirements of the applications. Thus, the performance and scalability of a parallel application can suffer (even using the best MPI library) if topology-aware scheduling is not employed. Moreover, the placement of logical MPI ranks on a supercomputing system can significantly affect overall application performance. A naive task assignment can result in poor locality of communication. Thus, it is important to design optimal mapping schemes with topology information to improve the overall application performance and scalability. It is also critical for users of High Performance Computing

[1]  S. C. Johnson Hierarchical clustering schemes , 1967, Psychometrika.

[2]  Miss A.O. Penney (b) , 1974, The New Yale Book of Quotations.

[3]  Kenneth Steiglitz,et al.  Combinatorial Optimization: Algorithms and Complexity , 1981 .

[4]  Shahid H. Bokhari,et al.  On the Mapping Problem , 1981, IEEE Transactions on Computers.

[5]  Charles E. Leiserson,et al.  Randomized Routing on Fat-Trees , 1989, Adv. Comput. Res..

[6]  Jake K. Aggarwal,et al.  A Mapping Strategy for Parallel Processing , 1987, IEEE Transactions on Computers.

[7]  Francine Berman,et al.  On Mapping Parallel Algorithms into Parallel Architectures , 1987, J. Parallel Distributed Comput..

[8]  M. Nei,et al.  The neighbor-joining method , 1987 .

[9]  Scott F. Midkiff,et al.  Processor and Link Assignment in Multicomputers Using Simulated Annealing , 1988, ICPP.

[10]  J. Ramanujam,et al.  Task allocation onto a hypercube by recursive mincut bipartitioning , 1990, C3P.

[11]  Takanobu Baba,et al.  A network-topology independent task allocation strategy for parallel computers , 1990, Proceedings SUPERCOMPUTING '90.

[12]  Scott F. Midkiff,et al.  Heuristic Technique for Processor and Link Assignment in Multicomputers , 1991, IEEE Trans. Computers.

[13]  Geoffrey C. Fox,et al.  Allocating data to multicomputer nodes by physical optimization algorithms for loosely synchronous computations , 1992, Concurr. Pract. Exp..

[14]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[15]  Geoffrey C. Fox,et al.  Graph contraction for physical optimization methods: a quality-cost tradeoff for mapping data on parallel computers , 1993, ICS '93.

[16]  Don Allen,et al.  A scalable debugger for massively parallel message-passing programs , 1994, IEEE Parallel & Distributed Technology: Systems & Applications.

[17]  Charles L. Seitz,et al.  Myrinet: A Gigabit-per-Second Local Area Network , 1995, IEEE Micro.

[18]  Mohan Kumar,et al.  On generalized fat trees , 1995, Proceedings of 9th International Parallel Processing Symposium.

[19]  S. Arunkumar,et al.  Genetic algorithm based heuristics for the mapping problem , 1995, Comput. Oper. Res..

[20]  Jean Roman,et al.  SCOTCH: A Software Package for Static Mapping by Dual Recursive Bipartitioning of Process and Architecture Graphs , 1996, HPCN Europe.

[21]  Steven L. Scott,et al.  The Cray T3E Network: Adaptive Routing in a High Performance 3D Torus , 1996 .

[22]  Eli Upfal,et al.  Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems , 1997, IEEE Trans. Parallel Distributed Syst..

[23]  Laxmikant V. Kalé,et al.  Branch and Bound Based Load Balancing for Parallel Applications , 1999, ISCOPE.

[24]  Viktor K. Prasanna,et al.  Adaptive Communication Algorithms for Distributed Heterogeneous Systems , 1999, J. Parallel Distributed Comput..

[25]  Hee Yong Youn,et al.  Processor Scheduling and Allocation for 3D Torus Multicomputer Systems , 2000, IEEE Trans. Parallel Distributed Syst..

[26]  Francisco Tirado,et al.  Impact of PE Mapping on Cray T3E Message-Passing Performance , 2000, Euro-Par.

[27]  Wu-chun Feng,et al.  The Quadrics network (QsNet): high-performance clustering technology , 2001, HOT 9 Interconnects. Symposium on High Performance Interconnects.

[28]  Allen D. Malony,et al.  Performance Technology for Complex Parallel and Distributed Systems , 2001 .

[29]  Robert D. Falgout,et al.  hypre: A Library of High Performance Preconditioners , 2002, International Conference on Computational Science.

[30]  Bronis R. de Supinski,et al.  A Multilevel Approach to Topology-Aware Collective Operations in Computational Grids , 2002, ArXiv.

[31]  José E. Moreira,et al.  Job Scheduling for the BlueGene/L System (Research Note) , 2002, Euro-Par.

[32]  José E. Moreira,et al.  Job Scheduling for the BlueGene/L System , 2002, JSSPP.

[33]  Vipin Kumar,et al.  Parallel static and dynamic multi‐constraint graph partitioning , 2002, Concurr. Comput. Pract. Exp..

[34]  Scott Pakin,et al.  STORM: Lightning-Fast Resource Management , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[35]  Rajeev Thakur,et al.  Improving the Performance of Collective Operations in MPICH , 2003, PVM/MPI.

[36]  Antonio Robles,et al.  Supporting fully adaptive routing in InfiniBand networks , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[37]  Katherine Yelick,et al.  UPC Language Specifications V1.1.1 , 2003 .

[38]  Andy B. Yoo,et al.  Approved for Public Release; Further Dissemination Unlimited X-ray Pulse Compression Using Strained Crystals X-ray Pulse Compression Using Strained Crystals , 2002 .

[39]  Craig A. Lee,et al.  Topology-Aware Communication in Wide-Area Message-Passing , 2003, PVM/MPI.

[40]  Dhabaleswar K. Panda,et al.  Design and implementation of MPICH2 over InfiniBand with RDMA support , 2003, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[41]  J.L. Traff Hierarchical gather/scatter algorithms with graceful degradation , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[42]  Marjan Gusev,et al.  Improving Multilevel Approach for Optimizing Collective Communications in Computational Grids , 2005, EGC.

[43]  James A. Kahle,et al.  The Cell Processor Architecture , 2005, MICRO.

[44]  P. Maechling,et al.  Strong shaking in Los Angeles expected from southern San Andreas earthquake , 2006 .

[45]  Rajeev Thakur,et al.  Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..

[46]  Amith R. Mamidala,et al.  Efficient Shared Memory and RDMA Based Design for MPI_Allgather over InfiniBand , 2006, PVM/MPI.

[47]  Reagan Moore,et al.  Optimization and Scalability of an Large-scale Earthquake Simulation Application , 2006 .

[48]  Jing Zhu,et al.  Enabling Very-Large Scale Earthquake Simulations on Parallel Machines , 2007, International Conference on Computational Science.

[49]  Dhabaleswar K. Panda,et al.  Designing Efficient Asynchronous Memory Operations Using Hardware Copy Engine: A Case Study with I/OAT , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[50]  Dhabaleswar K. Panda,et al.  Efficient asynchronous memory copy operations on multi-core systems and I/OAT , 2007, 2007 IEEE International Conference on Cluster Computing.

[51]  Cloyce D. Spradling SPEC CPU2006 benchmark tools , 2007, CARN.

[52]  Xin Yuan,et al.  Bandwidth Efficient All-reduce Operation on Tree Topologies , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[53]  Xin Yuan,et al.  An MPI tool for automatically discovering the switch level topologies of Ethernet clusters , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[54]  Amith R. Mamidala,et al.  MPI Collectives on Modern Multicore Clusters: Performance Optimizations and Communication Characteristics , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[55]  Torsten Hoefler,et al.  Multistage switches are not crossbars: Effects of static routing in high-performance networks , 2008, 2008 IEEE International Conference on Cluster Computing.

[56]  Amith R. Mamidala,et al.  Scaling alltoall collective on multi-core systems , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[57]  Torsten Hoefler,et al.  Adaptive Routing Strategies for Modern High Performance Networks , 2008, 2008 16th IEEE Symposium on High Performance Interconnects.

[58]  C. Walshaw JOSTLE : parallel multilevel graph-partitioning software – an overview , 2008 .

[59]  Philip Heidelberger,et al.  Optimization of All-to-All Communication on the Blue Gene/L Supercomputer , 2008, 2008 37th International Conference on Parallel Processing.

[60]  Galen M. Shipman,et al.  MPI Support for Multi-core Architectures: Optimized Shared Memory Collectives , 2008, PVM/MPI.

[61]  Dhabaleswar K. Panda,et al.  Designing multi-leader-based Allgather algorithms for multi-core clusters , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[62]  Dhabaleswar K. Panda,et al.  RDMA over Ethernet — A preliminary study , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[63]  Travis J. Wheeler,et al.  Large-Scale Neighbor-Joining with NINJA , 2009, WABI.

[64]  Torsten Hoefler,et al.  The impact of network noise at large-scale communication performance , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[65]  Thomas Hérault,et al.  MPI Applications on Grids: A Topology Aware Approach , 2009, Euro-Par.

[66]  Laxmikant V. Kalé,et al.  An evaluative study on the effect of contention on message latencies in large supercomputers , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[67]  G. Edward Suh,et al.  Application-aware deadlock-free oblivious routing , 2009, ISCA '09.

[68]  Thomas Hérault,et al.  Running Parallel Applications with Topology-Aware Grid Middleware , 2009, 2009 Fifth IEEE International Conference on e-Science.

[69]  Nicholas J. Wright,et al.  Characterizing Parallel Scaling of Scientific Applications using IPM , 2009 .

[70]  Amith R. Mamidala,et al.  Designing Efficient FTP Mechanisms for High Performance Data-Transfer over InfiniBand , 2009, 2009 International Conference on Parallel Processing.

[71]  Jing Zhu,et al.  Toward petascale earthquake simulations , 2009 .

[72]  M. Jette,et al.  Simple Linux Utility for Resource Management , 2009 .

[73]  L. Kalé,et al.  A Pattern Language for Topology Aware Mapping , 2009 .

[74]  Sayantan Sur,et al.  Improving Application Performance and Predictability Using Multiple Virtual Lanes in Modern Multi-core InfiniBand Clusters , 2010, 2010 39th International Conference on Parallel Processing.

[75]  Guillaume Mercier,et al.  hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications , 2010, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing.

[76]  Wolfgang E. Nagel,et al.  VAMPIR: Visualization and Analysis of MPI Resources , 2010 .

[77]  Sayantan Sur,et al.  Quantifying performance benefits of overlap using MPI-2 in a seismic modeling application , 2010, ICS '10.

[78]  Sayantan Sur,et al.  Unifying UPC and MPI runtimes: experience with MVAPICH , 2010, PGAS '10.

[79]  Dhabaleswar K. Panda,et al.  Designing topology-aware collective communication algorithms for large scale InfiniBand clusters: Case studies with Scatter and Gather , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[80]  William Gropp,et al.  A Scalable MPI_Comm_split Algorithm for Exascale Computing , 2010, EuroMPI.

[81]  Emmanuel Jeannot,et al.  Near-Optimal Placement of MPI Processes on Hierarchical NUMA Architectures , 2010, Euro-Par.

[82]  Sayantan Sur,et al.  Design and Evaluation of Generalized Collective Communication Primitives with Overlap Using ConnectX-2 Offload Engine , 2010, 2010 18th IEEE Symposium on High Performance Interconnects.

[83]  Laxmikant V. Kale,et al.  Automating Topology Aware Mapping for Supercomputers , 2010 .

[84]  Michael Lang,et al.  Optimized InfiniBandTM fat‐tree routing for shift all‐to‐all communication patterns , 2010, Concurr. Comput. Pract. Exp..

[85]  J. C. Vassilicos,et al.  A numerical strategy to combine high-order schemes, complex geometry and parallel computing for high resolution DNS of fractal generated turbulence , 2010 .

[86]  Darren J. Kerbyson,et al.  Optimized InfiniBand TM fat-tree routing for shift all-to-all communication patterns , 2010, ISC 2010.

[87]  Matthias S. Müller,et al.  SPEC MPI2007—an application benchmark suite for parallel systems using MPI , 2010, ISC 2010.

[88]  Dhabaleswar K. Panda,et al.  High Performance Data Transfer in Grid Environment Using GridFTP over InfiniBand , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[89]  Bernd Mohr,et al.  The Scalasca performance toolset architecture , 2010, Concurr. Comput. Pract. Exp..

[90]  Laxmikant V. Kalé,et al.  Optimizing communication for Charm++ applications by reducing network contention , 2011, Concurr. Comput. Pract. Exp..

[91]  Sayantan Sur,et al.  Multi-threaded UPC runtime with network endpoints: Design alternatives and evaluation on multi-core architectures , 2011, 2011 18th International Conference on High Performance Computing.

[92]  Jonathan Green,et al.  Multi-core and Network Aware MPI Topology Functions , 2011, EuroMPI.

[93]  Sayantan Sur,et al.  Design and Evaluation of Network Topology-/Speed- Aware Broadcast Algorithms for InfiniBand Clusters , 2011, 2011 IEEE International Conference on Cluster Computing.

[94]  Xian-He Sun,et al.  Layout-aware scientific computing: a case study using MILC , 2011, ScalA '11.

[95]  Emmanuel Jeannot,et al.  Improving MPI Applications Performance on Multicore Clusters with Rank Reordering , 2011, EuroMPI.

[96]  Torsten Hoefler,et al.  Generic topology mapping strategies for large-scale parallel architectures , 2011, ICS '11.

[97]  Sriram Krishnamoorthy,et al.  Noncollective Communicator Creation in MPI , 2011, EuroMPI.

[98]  Vipin Chaudhary,et al.  Rack aware scheduling in HPC data centers: an energy conservation strategy , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[99]  Jose Sreeram,et al.  UPC Queues for Scalable Graph Traversals: Design and Evaluation on InfiniBand Clusters , 2011 .

[100]  Dhabaleswar K. Panda,et al.  Scalable Memcached Design for InfiniBand Clusters Using Hybrid Transports , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[101]  Robert D. Falgout,et al.  Scaling Hypre's Multigrid Solvers to 100, 000 Cores , 2011, High-Performance Scientific Computing.

[102]  Dhabaleswar K. Panda,et al.  Minimizing Network Contention in InfiniBand Clusters with a QoS-Aware Data-Staging Framework , 2012, 2012 IEEE International Conference on Cluster Computing.

[103]  Dhabaleswar K. Panda,et al.  Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and Performance Evaluation , 2012, 2012 41st International Conference on Parallel Processing.

[104]  Eitan Zahavi Fat-tree routing and node ordering providing contention free traffic for MPI global collectives , 2012, J. Parallel Distributed Comput..

[105]  Dhabaleswar K. Panda,et al.  Design of a scalable InfiniBand topology service to enable network-topology-aware placement of processes , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[106]  Dhabaleswar K. Panda,et al.  Designing Non-blocking Allreduce with Collective Offload on InfiniBand Clusters: A Case Study with Conjugate Gradient Solvers , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[107]  Dhabaleswar K. Panda,et al.  Designing Network Failover and Recovery in MPI for Multi-Rail InfiniBand Clusters , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[108]  Yiannis Georgiou,et al.  Evaluating Scalability and Efficiency of the Resource and Job Management System on Large HPC Clusters , 2012, JSSPP.

[109]  Dhabaleswar K. Panda,et al.  High performance RDMA-based design of HDFS over InfiniBand , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[110]  D. Panda,et al.  Extending OpenSHMEM for GPU Computing , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[111]  Torsten Hoefler,et al.  Bandwidth-optimal all-to-all exchanges in fat tree networks , 2013, ICS '13.

[112]  Dhabaleswar K. Panda,et al.  MIC-RO: enabling efficient remote offload on heterogeneous many integrated core (MIC) clusters with InfiniBand , 2013, ICS '13.

[113]  S.,et al.  An Efficient Heuristic Procedure for Partitioning Graphs , 2022 .