An Integrated Tutorial on InfiniBand, Verbs, and MPI

This tutorial presents the details of the interconnection network utilized in many high performance computing (HPC) systems today. “InfiniBand” is the hardware interconnect utilized by over 35% of the top 500 supercomputers in the world as of June, 2017. “Verbs” is the term used for both the semantic description of the interface in the InfiniBand architecture specifications, and the name used for the functions defined in the widely used OpenFabrics alliance implementation of the software interface to InfiniBand. “Message passing interface” is the primary software library by which HPC applications portably pass messages between processes across a wide range of interconnects including InfiniBand. Our goal is to explain how these three components are designed and how they interact to provide a powerful, efficient interconnect for HPC applications. We provide a succinct look into the inner technical workings of each component that should be instructive to both novices to HPC applications as well as to those who may be familiar with one component, but not necessarily the others, in the design and functioning of the total interconnect. A supercomputer interconnect is not a monolithic structure, and this tutorial aims to give non-experts a “big-picture” overview of its substructure with an appreciation of how and why features in one component influence those in others. We believe this is one of the first tutorials to discuss these three major components as one integrated whole. In addition, we give detailed examples of practical experience and typical algorithms used within each component in order to give insights into what issues and trade-offs are important.

[1]  Wei Huang,et al.  High performance virtual machine migration with RDMA over modern interconnects , 2007, 2007 IEEE International Conference on Cluster Computing.

[2]  Enhancing InfiniBand with OpenFlow-Style SDN Capability , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[3]  Robert B. Ross,et al.  Advanced MPI: I/O and one-sided communication , 2006, SC.

[4]  Torsten Hoefler,et al.  A practically constant-time MPI Broadcast Algorithm for large-scale InfiniBand Clusters with Multicast , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[5]  Antonio Robles,et al.  Improving the Up*/Down* Routing Scheme for Networks of Workstations , 2000, Euro-Par.

[6]  Qian Liu,et al.  Improvements to the InfiniBand Congestion Control Mechanism , 2016, 2016 IEEE 24th Annual Symposium on High-Performance Interconnects (HOTI).

[7]  Qian Liu,et al.  A performance study of InfiniBand fourteen data rate (FDR) , 2014, SpringSim.

[8]  A. V. Krishnamoorthy,et al.  A 50Tbps optically-cabled Infiniband datacenter switch , 2013, 2013 Optical Fiber Communication Conference and Exposition and the National Fiber Optic Engineers Conference (OFC/NFOEC).

[9]  Olav Lysne,et al.  On the Relation between Congestion Control, Switch Arbitration and Fairness , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[10]  Eric A. Hansen,et al.  Breadth-first heuristic search , 2004, Artif. Intell..

[11]  Qian Liu,et al.  RGBCC: A New Congestion Control Mechanism for InfiniBand , 2016, 2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP).

[12]  Jinman Jung,et al.  Concurrency control scheme for key-value stores based on InfiniBand , 2014, RACS '14.

[13]  Dhabaleswar K. Panda,et al.  Host-assisted zero-copy remote memory access communication on InfiniBand , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[14]  Debra Hensgen,et al.  Two algorithms for barrier synchronization , 1988, International Journal of Parallel Programming.

[15]  Hyeonsang Eom,et al.  Towards High-Performance SAN with Fast Storage Devices , 2014, TOS.

[16]  Olav Lysne,et al.  Exploring the Scope of the InfiniBand Congestion Control Mechanism , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[17]  Laurent Schares,et al.  Optics in Future Data Center Networks , 2010, 2010 18th IEEE Symposium on High Performance Interconnects.

[18]  Anthony Skjellum,et al.  Software Architecture and Performance Comparison of MPI/Pro and MPICH , 2003, International Conference on Computational Science.

[19]  Keichi Takahashi,et al.  Concept and Design of SDN-Enhanced MPI Framework , 2015, 2015 Fourth European Workshop on Software Defined Networks.

[20]  Sayantan Sur,et al.  RDMA read based rendezvous protocol for MPI over InfiniBand: design alternatives and benefits , 2006, PPoPP '06.

[21]  Sayantan Sur,et al.  Improving Application Performance and Predictability Using Multiple Virtual Lanes in Modern Multi-core InfiniBand Clusters , 2010, 2010 39th International Conference on Parallel Processing.

[22]  D. Panda,et al.  Reducing Connection Memory Requirements of MPI for InfiniBand Clusters: A Message Coalescing Approach , 2007, Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07).

[23]  Antonio Robles,et al.  A New Methodology to Computer Deadlock-Free Routing Tables for Irregular Networks , 2000, CANPC.

[24]  Wolfgang Kellerer,et al.  Software Defined Optical Networks (SDONs): A Comprehensive Survey , 2015, IEEE Communications Surveys & Tutorials.

[25]  Galen M. Shipman,et al.  Infiniband scalability in Open MPI , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[26]  Dhabaleswar K. Panda,et al.  High Performance Data Transfer in Grid Environment Using GridFTP over InfiniBand , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[27]  José Duato,et al.  A new proposal to deal with congestion in InfiniBand-based fat-trees , 2014, J. Parallel Distributed Comput..

[28]  William Gropp,et al.  Chameleon parallel programming tools users manual , 1993 .

[29]  Misook Kim,et al.  An Efficient Buffer Allocation Technique for Virtual Lanes in InfiniBand Networks , 2003, Human.Society@Internet 2003.

[30]  Olav Lysne,et al.  Layered shortest path (LASH) routing in irregular system area networks , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[31]  Qian Liu,et al.  The dynamic nature of Congestion inInfiniBand , 2015, 2015 International Conference and Workshop on Computing and Communication (IEMCON).

[32]  Nan Ni,et al.  Congestion control in InfiniBand networks , 2005, 13th Symposium on High Performance Interconnects (HOTI'05).

[33]  Pedro López,et al.  Combining In-Transit Buffers with Optimized Routing Schemes to Boost the Performance of Networks with Source Routing , 2000, ISHPC.

[34]  David G. Andersen,et al.  Using RDMA efficiently for key-value services , 2015, SIGCOMM 2015.

[35]  Arkady Kanevsky,et al.  Remote Direct Memory Access over the Converged Enhanced Ethernet Fabric: Evaluating the Options , 2009, 2009 17th IEEE Symposium on High Performance Interconnects.

[36]  José Duato,et al.  Buffer Management Strategies to Reduce HoL Blocking , 2010, IEEE Transactions on Parallel and Distributed Systems.

[37]  Krishna Kant,et al.  Data center evolution: A tutorial on state of the art, issues, and challenges , 2009, Comput. Networks.

[38]  Torsten Hoefler,et al.  Adaptive Routing Strategies for Modern High Performance Networks , 2008, 2008 16th IEEE Symposium on High Performance Interconnects.

[39]  Dhabaleswar K. Panda,et al.  Can software reliability outperform hardware reliability on high performance interconnects?: a case study with MPI over infiniband , 2008, ICS '08.

[40]  Hassen Sallay,et al.  Survey on Architectures and Communication Libraries dedicated for High Speed Networks , 2011, J. Ubiquitous Syst. Pervasive Networks.

[41]  Olav Lysne,et al.  First experiences with congestion control in InfiniBand hardware , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[42]  Dhabaleswar K. Panda,et al.  High performance RDMA-based design of HDFS over InfiniBand , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[43]  Sven-Arne Reinemo,et al.  InfiniBand congestion control: modelling and validation , 2011, SimuTools.

[44]  Yeh-Ching Chung,et al.  An efficient deadlock-free tree-based routing algorithm for irregular wormhole-routed networks based on the turn model , 2004 .

[45]  Stephen E. Deering,et al.  IP Version 6 Addressing Architecture , 1995, RFC.

[46]  Dhabaleswar K. Panda,et al.  Scalable MPI design over InfiniBand using eXtended Reliable Connection , 2008, 2008 IEEE International Conference on Cluster Computing.

[47]  Torsten Hoefler,et al.  Optimized Routing for Large-Scale InfiniBand Networks , 2009, 2009 17th IEEE Symposium on High Performance Interconnects.

[48]  Lionel M. Ni,et al.  Adaptive routing in irregular networks using cut-through switches , 1996, Proceedings of the 1996 ICPP Workshop on Challenges for Parallel Processing.

[49]  Dror Goldenberg,et al.  Zero copy sockets direct protocol over infiniband-preliminary implementation and performance analysis , 2005, 13th Symposium on High Performance Interconnects (HOTI'05).

[50]  Vern Paxson,et al.  TCP Congestion Control , 1999, RFC.

[51]  Torsten Hoefler,et al.  Analysis of the Memory Registration Process in the Mellanox InfiniBand Software Stack , 2006, Euro-Par.

[52]  W. W. PETERSONt,et al.  Cyclic Codes for Error Detection * , 2022 .

[53]  Qian Liu,et al.  A Dynamic Congestion Management System for InfiniBand Networks , 2016, Supercomput. Front. Innov..

[54]  Dhabaleswar K. Panda,et al.  Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device , 2005, 2005 IEEE International Conference on Cluster Computing.

[55]  Robert Klöfkorn,et al.  Asynchronous communication in spectral-element and discontinuous Galerkin methods for atmospheric dynamics– a case study using the High-Order Methods Modeling Environment (HOMME-homme_dg_branch) , 2016 .

[56]  José Duato,et al.  On the Performance of Up*/Down* Routing , 2000, CANPC.

[57]  Steven J. Martin Cray XC30 Power Monitoring and Management , 2014 .

[58]  Sabine Richling,et al.  A long-distance infiniband interconnection between two clusters in production use , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[59]  J. J. Garcia-Luna-Aceves,et al.  A Minimum-Hop Routing Algorithm Based on Distributed Information , 1989, Comput. Networks.

[60]  Michael Burrows,et al.  Autonet: A High-Speed, Self-Configuring Local Area Network Using Point-to-Point Links , 1991, IEEE J. Sel. Areas Commun..

[61]  Ludmila Cherkasova,et al.  Fibre channel fabrics: evaluation and design , 1996, Proceedings of HICSS-29: 29th Hawaii International Conference on System Sciences.

[62]  Sayantan Sur,et al.  High performance MPI design using unreliable datagram for ultra-scale InfiniBand clusters , 2007, ICS '07.

[63]  W. Collins,et al.  The Community Earth System Model: A Framework for Collaborative Research , 2013 .

[64]  Robert B. Ross,et al.  Using MPI-2: Advanced Features of the Message Passing Interface , 2003, CLUSTER.

[65]  Dhabaleswar K. Panda,et al.  High performance RDMA-based MPI implementation over InfiniBand , 2003, ICS.

[66]  Wei Huang,et al.  Performance Analysis and Evaluation of PCIe 2.0 and Quad-Data Rate InfiniBand , 2008, 2008 16th IEEE Symposium on High Performance Interconnects.

[67]  Frank O. Bryan,et al.  Yellowstone: A Dedicated Resource for Earth System Science , 2017 .

[68]  Jose Renato Santos,et al.  End-to-end congestion control for infiniband , 2003, IEEE INFOCOM 2003. Twenty-second Annual Joint Conference of the IEEE Computer and Communications Societies (IEEE Cat. No.03CH37428).

[69]  José Duato,et al.  On the Infiniband subnet discovery process , 2003, 2003 Proceedings IEEE International Conference on Cluster Computing.

[70]  Amith R. Mamidala,et al.  Designing Efficient FTP Mechanisms for High Performance Data-Transfer over InfiniBand , 2009, 2009 International Conference on Parallel Processing.

[71]  Dhabaleswar K. Panda,et al.  Designing a high-performance clustered NAS: a case study with pNFS over RDMA on InfiniBand , 2008, HiPC'08.

[72]  William J. Dally,et al.  Deadlock-Free Message Routing in Multiprocessor Interconnection Networks , 1987, IEEE Transactions on Computers.

[73]  George Bosilca,et al.  Open MPI: A High-Performance, Heterogeneous MPI , 2006, 2006 IEEE International Conference on Cluster Computing.

[74]  Yijie Han,et al.  An Optimal Scheme for Disseminating Information , 1988, ICPP.

[75]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[76]  Xin Yuan,et al.  LID Assignment in InfiniBand Networks , 2009, IEEE Transactions on Parallel and Distributed Systems.

[77]  Joonhyouk Jang,et al.  Integrated financial trading system based on distributed in-memory database , 2014, RACS '14.

[78]  Dhabaleswar K. Panda,et al.  Performance Analysis and Evaluation of InfiniBand FDR and 40GigE RoCE on HPC and Cloud Computing Systems , 2012, 2012 IEEE 20th Annual Symposium on High-Performance Interconnects.

[79]  Yitzhak Birk,et al.  Improving communication-phase completion times in HPC clusters through congestion mitigation , 2009, SYSTOR '09.

[80]  Katherine E. Isaacs,et al.  There goes the neighborhood: Performance degradation due to nearby jobs , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[81]  Michael Kenneth Lang Software Defined Networking for HPC Interconnect and its Extension across Domains , 2016 .

[82]  Eugene D. Brooks,et al.  The butterfly barrier , 1986, International Journal of Parallel Programming.

[83]  Thomas Narten,et al.  IPv6 Stateless Address Autoconfiguration , 1996, RFC.

[84]  Odysseas I. Pentakalos An Introduction to the InfiniBand Architecture , 2002, Int. CMG Conference.

[85]  Dan Meng,et al.  Early Experiences with Write-Write Design of NFS over RDMA , 2009, 2009 IEEE International Conference on Networking, Architecture, and Storage.

[86]  Yeh-Ching Chung,et al.  A multiple LID routing scheme for fat-tree-based InfiniBand networks , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[87]  Torsten Hoefler,et al.  Deadlock-Free Oblivious Routing for Arbitrary Topologies , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[88]  Dhabaleswar K. Panda,et al.  Design and implementation of MPICH2 over InfiniBand with RDMA support , 2003, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[89]  Olav Lysne,et al.  dFtree: a fat-tree routing algorithm using dynamic allocation of virtual lanes to alleviate congestion in infiniband networks , 2011, NDM '11.

[90]  A. Benner,et al.  Optical interconnect opportunities in supercomputers and high end computing , 2012, OFC/NFOEC.

[91]  Robert D. Russell,et al.  A Performance Study to Guide RDMA Programming Decisions , 2012, 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems.