An In-Depth Analysis of the Slingshot Interconnect

The interconnect is one of the most critical components in large scale computing systems, and its impact on the performance of applications is going to increase with the system size. In this paper, we will describe Slingshot, an interconnection network for large scale computing systems. Slingshot is based on high-radix switches, which allow building exascale and hyperscale datacenters networks with at most three switch-to-switch hops. Moreover, Slingshot provides efficient adaptive routing and congestion control algorithms, and highly tunable traffic classes. Slingshot uses an optimized Ethernet protocol, which allows it to be interoperable with standard Ethernet devices while providing high performance to HPC applications. We analyze the extent to which Slingshot provides these features, evaluating it on microbenchmarks and on several applications from the datacenter and AI worlds, as well as on HPC applications. We find that applications running on Slingshot are less affected by congestion compared to previous generation networks.

[1]  Takayuki Okamoto,et al.  The Tofu Interconnect D , 2018, 2018 IEEE International Conference on Cluster Computing (CLUSTER).

[2]  Dan Alistarh,et al.  Taming unbalanced training workloads in deep learning with partial collective operations , 2019, PPoPP.

[3]  Thomas G. Robertazzi,et al.  Input Versus Output Queueing on a SpaceDivision Packet Switch , 1993 .

[4]  Torsten Hoefler,et al.  A Modular Benchmarking Infrastructure for High-Performance and Reproducible Deep Learning , 2019, 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[5]  Torsten Hoefler,et al.  Performance Modeling and Comparative Analysis of the MILC Lattice QCD Application su3_rmd , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[6]  Alexander Shpiner,et al.  Dragonfly+: Low Cost Topology for Scaling Datacenters , 2017, 2017 IEEE 3rd International Workshop on High-Performance Interconnection Networks in the Exascale and Big-Data Era (HiPINEB).

[7]  Guihai Chen,et al.  DCQCN+: Taming Large-Scale Incast Congestion in RDMA over Ethernet Networks , 2018, 2018 IEEE 26th International Conference on Network Protocols (ICNP).

[8]  Torsten Hoefler,et al.  The impact of network noise at large-scale communication performance , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[9]  George Bosilca,et al.  Impact of Kernel-Assisted MPI Communication over Scientific Applications: CPMD and FFTW , 2011, EuroMPI.

[10]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[11]  Vishal Misra,et al.  ECN or Delay: Lessons Learnt from Analysis of DCQCN and TIMELY , 2016, CoNEXT.

[12]  Torsten Hoefler,et al.  Bandwidth-optimal all-to-all exchanges in fat tree networks , 2013, ICS '13.

[13]  Jack J. Dongarra,et al.  High-performance conjugate-gradient benchmark: A new metric for ranking high-performance computing systems , 2016, Int. J. High Perform. Comput. Appl..

[14]  John M. Mellor-Crummey,et al.  Understanding congestion in high performance interconnection networks using sampling , 2019, SC.

[15]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.

[16]  Katherine E. Isaacs,et al.  There goes the neighborhood: Performance degradation due to nearby jobs , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[17]  Samuel P. Morgan,et al.  Input Versus Output Queueing on a Space-Division Packet Switch , 1987, IEEE Trans. Commun..

[18]  Ricardo Bianchini,et al.  Managing Tail Latency in Datacenter-Scale File Systems Under Production Constraints , 2019, EuroSys.

[19]  Ming Zhang,et al.  Congestion Control for Large-Scale RDMA Deployments , 2015, Comput. Commun. Rev..

[20]  Christian E. Hopps,et al.  Analysis of an Equal-Cost Multi-Path Algorithm , 2000, RFC.

[21]  Torsten Hoefler,et al.  FatPaths: Routing in Supercomputers, Data Centers, and Clouds with Low-Diameter Networks when Shortest Paths Fall Short , 2019, ArXiv.

[22]  Torsten Hoefler,et al.  Mitigating network noise on Dragonfly networks through application-aware routing , 2019, SC.

[23]  Daniel Sánchez,et al.  Tailbench: a benchmark suite and evaluation methodology for latency-critical applications , 2016, 2016 IEEE International Symposium on Workload Characterization (IISWC).

[24]  Torsten Hoefler,et al.  Cost-effective diameter-two topologies: analysis and evaluation , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[25]  Eddie Kohler,et al.  Speedy transactions in multicore in-memory databases , 2013, SOSP.

[26]  Nicholas J. Wright,et al.  GPCNeT: designing a benchmark suite for inducing and measuring contention in HPC networks , 2019, SC.

[27]  Ching-Hsing Yu,et al.  Deploying a Top-100 Supercomputer for Large Parallel Workloads: the Niagara Supercomputer , 2019, PEARC.

[28]  Paul Lamere,et al.  Sphinx-4: a flexible open source framework for speech recognition , 2004 .

[29]  Martin Wimmer Programming models for parallel computing , 2010 .

[30]  Steve Plimpton,et al.  Fast parallel algorithms for short-range molecular dynamics , 1993 .

[31]  Mateo Valero,et al.  On-the-Fly Adaptive Routing in High-Radix Hierarchical Networks , 2012, 2012 41st International Conference on Parallel Processing.

[32]  Sally Floyd,et al.  TCP and explicit congestion notification , 1994, CCRV.

[33]  Kevin Harms,et al.  Characterization of MPI Usage on a Production Supercomputer , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[34]  Charles Clos,et al.  A study of non-blocking switching networks , 1953 .

[35]  Abhinav Bhatele,et al.  Evaluation of an Interference-free Node Allocation Policy on Fat-tree Clusters , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[36]  George Almási PGAS (Partitioned Global Address Space) Languages , 2011, Encyclopedia of Parallel Computing.

[37]  Gottlieb,et al.  Hybrid-molecular-dynamics algorithms for the numerical simulation of quantum chromodynamics. , 1987, Physical review. D, Particles and fields.

[38]  Ankit Singla,et al.  Jellyfish: Networking Data Centers Randomly , 2011, NSDI.

[39]  Amin Vahdat,et al.  A scalable, commodity data center network architecture , 2008, SIGCOMM '08.

[40]  D. Roweth,et al.  Cray XC ® Series Network , 2012 .

[41]  Torsten Hoefler,et al.  Slim Fly: A Cost Effective Low-Diameter Network Topology , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[42]  Jens Domke,et al.  Mitigating Inter-Job Interference Using Adaptive Flow-Aware Routing , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[43]  Jack Dongarra,et al.  LINPACK Users' Guide , 1987 .

[44]  Robert B. Ross,et al.  Watch Out for the Bully! Job Interference Study on Dragonfly Network , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[45]  Shin'ichi Miura,et al.  HyperX topology: first at-scale implementation and comparison to the fat-tree , 2019, SC.

[46]  Yuval Tamir,et al.  High-performance multiqueue buffers for VLSI communication switches , 1988, [1988] The 15th Annual International Symposium on Computer Architecture. Conference Proceedings.

[47]  Eric Borch,et al.  Megafly: A Topology for Exascale Systems , 2018, ISC.

[48]  Jung Ho Ahn,et al.  HyperX: topology, routing, and packaging of efficient large-scale networks , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[49]  Thomas E. Anderson,et al.  High-speed switch scheduling for local-area networks , 1993, TOCS.

[50]  Torsten Hoefler,et al.  Scientific Benchmarking of Parallel Computing Systems Twelve ways to tell the masses when reporting performance results , 2017 .

[51]  Michael Dinitz,et al.  Xpander: Unveiling the Secrets of High-Performance Datacenters , 2015, HotNets.

[52]  Rajeev Thakur,et al.  Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..

[53]  Graph Topology MPI at Exascale , 2010 .