Multistage switches are not crossbars: Effects of static routing in high-performance networks

Multistage interconnection networks based on central switches are ubiquitous in high-performance computing. Applications and communication libraries typically make use of such networks without consideration of the actual internal characteristics of the switch. However, application performance of these networks, particularly with respect to bisection bandwidth, does depend on communication paths through the switch. In this paper we discuss the limitations of the hardware definition of bisection bandwidth (capacity-based) and introduce a new metric: effective bisection bandwidth. We assess the effective bisection bandwidth of several large-scale production clusters by simulating artificial communication patterns on them. Networks with full bisection bandwidth typically provided effective bisection bandwidth in the range of 55-60%. Simulations with application-based patterns showed that the difference between effective and rated bisection bandwidth could impact overall application performance by up to 12%.

[1]  Michael Lang,et al.  Optimized InfiniBandTM fat‐tree routing for shift all‐to‐all communication patterns , 2010, Concurr. Comput. Pract. Exp..

[2]  Matthew J. Koop,et al.  High-Performance and Scalable MPI over InfiniBand with Reduced Memory Usage: An In-Depth performance Analysis , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[3]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[4]  Darren J. Kerbyson,et al.  Optimized InfiniBand TM fat-tree routing for shift all-to-all communication patterns , 2010, ISC 2010.

[5]  Suresh Chalasani,et al.  A comparison of adaptive wormhole routing algorithms , 1993, ISCA '93.

[6]  Takuji Nishimura,et al.  Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator , 1998, TOMC.

[7]  Chris J. Scheiman,et al.  LogGP: incorporating long messages into the LogP model—one step closer towards a realistic model for parallel computation , 1995, SPAA '95.

[8]  Torsten Hoefler,et al.  Accurately measuring collective operations at massive scale , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[9]  A. Mullin,et al.  Mathematical Theory of Connecting Networks and Telephone Traffic. , 1966 .

[10]  Torsten Hoefler,et al.  Scalable High Performance Message Passing over InfiniBand for Open MPI , 2007 .

[11]  G. Pfister,et al.  Solving Hot Spot Contention Using InfiniBand Architecture Congestion Control , 2005 .

[12]  Gregory F. Pfister,et al.  “Hot spot” contention and combining in multistage interconnection networks , 1985, IEEE Transactions on Computers.

[13]  Bruce M. Maggs,et al.  Communication-efficient parallel algorithms for distributed random-access machines , 1988, Algorithmica.

[14]  Michael Burrows,et al.  Autonet: A High-Speed, Self-Configuring Local Area Network Using Point-to-Point Links , 1991, IEEE J. Sel. Areas Commun..

[15]  Z. Ding,et al.  Level-wise Scheduling Algorithm for Fat Tree Interconnection Networks , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[16]  Jack J. Dongarra,et al.  Performance Analysis of MPI Collective Operations , 2005, IPDPS.

[17]  Amith R. Mamidala,et al.  Hot-Spot Avoidance With Multi-Pathing Over InfiniBand: An MPI Perspective , 2007, Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07).

[18]  Charles Clos,et al.  A study of non-blocking switching networks , 1953 .

[19]  Antonio Robles,et al.  Supporting fully adaptive routing in InfiniBand networks , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[20]  Fabrizio Petrini,et al.  k-ary n-trees: high performance networks for massively parallel architectures , 1997, Proceedings 11th International Parallel Processing Symposium.

[21]  Galen M. Shipman,et al.  Infiniband scalability in Open MPI , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[22]  Sally Anne Browning,et al.  The tree machine: a highly concurrent computing environment , 1980 .

[23]  Debra Hensgen,et al.  Two algorithms for barrier synchronization , 1988, International Journal of Parallel Programming.

[24]  Antonio Robles,et al.  Performance Evaluation of Up =down Routing using Virtual Channels for InniBand Networks , 2001 .

[25]  Fabrizio Petrini,et al.  Scalable collective communication on the ASCI Q machine , 2003, 11th Symposium on High Performance Interconnects, 2003. Proceedings..

[26]  F. Leighton,et al.  Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes , 1991 .

[27]  Charles E. Leiserson,et al.  Fat-trees: Universal networks for hardware-efficient supercomputing , 1985, IEEE Transactions on Computers.

[28]  Yeh-Ching Chung,et al.  A multiple LID routing scheme for fat-tree-based InfiniBand networks , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[29]  Jack J. Dongarra,et al.  Performance analysis of MPI collective operations , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[30]  Dave Turner,et al.  Integrating New Capabilities into NetPIPE , 2003, PVM/MPI.

[31]  Sayantan Sur,et al.  High performance MPI design using unreliable datagram for ultra-scale InfiniBand clusters , 2007, ICS '07.

[32]  Paul D. Gader,et al.  Image algebra techniques for parallel image processing , 1987 .