Optimal low-latency network topologies for cluster performance enhancement

We propose that clusters interconnected with network topologies having minimal mean path length will increase their processing speeds. We approach our heuristic by constructing clusters of up to 32 nodes having torus, ring, Chvatal, Wagner, Bidiakis and optimal topology for minimal mean path length and by simulating the performance of 256 nodes clusters with the same network topologies. The optimal (or near-optimal) low-latency network topologies are found by minimizing the mean path length of regular graphs. The selected topologies are benchmarked using ping-pong messaging, the MPI collective communications and the standard parallel applications including effective bandwidth, FFTE, Graph 500 and NAS parallel benchmarks. We established strong correlations between the clusters’ performances and the network topologies, especially the mean path lengths, for a wide range of applications. In communication-intensive benchmarks, optimal graphs enabled network topologies with multifold performance enhancement in comparison with mainstream graphs. It is striking that mere adjustment of the network topology suffices to reclaim performance from the same computing hardware.

[1]  Norman P. Jouppi,et al.  Readings in computer architecture , 2000 .

[2]  Pedro López,et al.  Towards an Efficient Fat-Tree like Topology , 2012, Euro-Par.

[3]  William J. Dally,et al.  Express Cubes: Improving the Performance of k-Ary n-Cube Interconnection Networks , 1989, IEEE Trans. Computers.

[4]  Turki F. Al-Somani,et al.  Topological Properties of Hierarchical Interconnection Networks: A Review and Comparison , 2011, J. Electr. Comput. Eng..

[5]  William J. Dally,et al.  Topology optimization of interconnection networks , 2006, IEEE Computer Architecture Letters.

[6]  Yuefan Deng,et al.  Symmetry insights for design of supercomputer network topologies: roots and weights lattices , 2012 .

[7]  Peng Zhang,et al.  Evaluation of Various Networks Configurated by Adding Bypass or Torus Links , 2015, IEEE Transactions on Parallel and Distributed Systems.

[8]  Lali Barrière,et al.  The generalized hierarchical product of graphs , 2009, Discret. Math..

[9]  Yuefan Deng,et al.  Symmetry-guided design of topologies for supercomputer networks , 2017, International Journal of Modern Physics C.

[10]  Charles E. Leiserson,et al.  Fat-trees: Universal networks for hardware-efficient supercomputing , 1985, IEEE Transactions on Computers.

[11]  Mike Higgins,et al.  Cray Cascade: A scalable HPC system based on a Dragonfly network , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[12]  Jack J. Dongarra,et al.  HPC Challenge Benchmark , 2011, Encyclopedia of Parallel Computing.

[13]  William J. Dally,et al.  Performance Analysis of k-Ary n-Cube Interconnection Networks , 1987, IEEE Trans. Computers.

[14]  William J. Dally,et al.  Technology-Driven, Highly-Scalable Dragonfly Topology , 2008, 2008 International Symposium on Computer Architecture.

[15]  Ibm Blue,et al.  Overview of the IBM Blue Gene/P Project , 2008, IBM J. Res. Dev..

[16]  Junming Xu Topological Structure and Analysis of Interconnection Networks , 2002, Network Theory and Applications.

[17]  Torsten Hoefler,et al.  Slim Fly: A Cost Effective Low-Diameter Network Topology , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[18]  M. M. Hafizur Rahman,et al.  Architecture and Network-on-Chip Implementation of a New Hierarchical Interconnection Network , 2015, J. Circuits Syst. Comput..

[19]  Philip Heidelberger,et al.  The IBM Blue Gene/Q interconnection network and message unit , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[20]  Jan Goedgebeur,et al.  Generation of cubic graphs and snarks with large girth , 2017, J. Graph Theory.

[21]  Daisuke Takahashi,et al.  High-Performance Radix-2, 3 and 5 Parallel 1-D Complex FFT Algorithms for Distributed-Memory Parallel Computers , 2000, The Journal of Supercomputing.

[22]  Ryuhei Mori,et al.  Average shortest path length of graphs of diameter 3 , 2016, 2016 Tenth IEEE/ACM International Symposium on Networks-on-Chip (NOCS).

[23]  Ryosuke Mizuno,et al.  Constructing large-scale low-latency network from small optimal networks , 2016, 2016 Tenth IEEE/ACM International Symposium on Networks-on-Chip (NOCS).

[24]  Brendan D. McKay,et al.  Generation of Cubic graphs , 2011, Discret. Math. Theor. Comput. Sci..

[25]  Elwood S. Buffa,et al.  Graph Theory with Applications , 1977 .

[26]  W. J. Langford Statistical Methods , 1959, Nature.

[27]  Kemal Efe A Variation on the Hypercube with Lower Diameter , 1991, IEEE Trans. Computers.

[28]  Jack Dongarra,et al.  Introduction to the HPCChallenge Benchmark Suite , 2004 .

[29]  Srinivasan Keshav,et al.  Quartz , 2014, SIGCOMM.

[30]  Hideharu Amano,et al.  Recursive Diagonal Torus: An Interconnection Network for Massively Parallel Computers , 2001, IEEE Trans. Parallel Distributed Syst..

[31]  Xiangke Liao,et al.  High Performance Interconnect Network for Tianhe System , 2015, Journal of Computer Science and Technology.

[32]  Rolf Rabenseifner,et al.  Benchmark design for characterization of balanced high-performance architectures , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[33]  G.E. Moore,et al.  Cramming More Components Onto Integrated Circuits , 1998, Proceedings of the IEEE.

[34]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[35]  Yuefan Deng,et al.  A new record of graph enumeration enabled by parallel processing , 2019 .

[36]  Yvain Thonnart,et al.  An analytical method for evaluating Network-on-Chip performance , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[37]  Peter Sanders,et al.  Think Locally, Act Globally: Highly Balanced Graph Partitioning , 2013, SEA.

[38]  Wei Ge,et al.  The Sunway TaihuLight supercomputer: system and applications , 2016, Science China Information Sciences.

[39]  Thomas E. Anderson,et al.  F10: A Fault-Tolerant Engineered Network , 2013, NSDI.

[40]  Daisuke Takahashi,et al.  A Blocking Algorithm for Parallel 1-D FFT on Shared-Memory Parallel Computers , 2002, PARA.

[41]  Hong Shen,et al.  A Low Cost Hybrid Fat-tree Interconnection Network , 1998 .

[42]  Donald D. Cowan,et al.  A partial census of trivalent generalized Moore networks , 1975 .

[43]  Lali Barrière,et al.  The hierarchical product of graphs , 2009, Discret. Appl. Math..

[44]  Shin'ichi Miura,et al.  HyperX topology: first at-scale implementation and comparison to the fat-tree , 2019, SC.

[45]  Toshiyuki Shimizu,et al.  Tofu: A 6D Mesh/Torus Interconnect for Exascale Computers , 2009, Computer.

[46]  Larry Kaplan,et al.  The Gemini System Interconnect , 2010, 2010 18th IEEE Symposium on High Performance Interconnects.

[47]  Steven L. Scott,et al.  The Cray T3E Network: Adaptive Routing in a High Performance 3D Torus , 1996 .

[48]  Philip Heidelberger,et al.  Blue Gene/L torus interconnection network , 2005, IBM J. Res. Dev..

[49]  Jung-hyun Seo,et al.  The hierarchical Petersen network: a new interconnection network with fixed degree , 2017, The Journal of Supercomputing.

[50]  Abdel Elah Al-Ayyoub,et al.  The Cross Product of Interconnection Networks , 1997, IEEE Trans. Parallel Distributed Syst..

[51]  Mitsuhisa Sato,et al.  A Method for Order/Degree Problem Based on Graph Symmetry and Simulated Annealing with MPI/OpenMP Parallelization , 2019, HPC Asia.

[52]  F. Harary,et al.  A survey of the theory of hypercube graphs , 1988 .

[53]  Henri Casanova,et al.  Versatile, scalable, and accurate simulation of distributed applications and platforms , 2014, J. Parallel Distributed Comput..

[54]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[55]  David H. Bailey,et al.  NAS parallel benchmark results , 1993, IEEE Parallel & Distributed Technology: Systems & Applications.

[56]  Christoph Lenzen,et al.  CLEX: Yet Another Supercomputer Architecture? , 2016, ArXiv.

[57]  J. A. Bondy,et al.  Graph Theory with Applications , 1978 .

[58]  William J. Dally,et al.  Principles and Practices of Interconnection Networks , 2004 .

[59]  Keith D. Underwood,et al.  SeaStar Interconnect: Balanced Bandwidth for Scalable Performance , 2006, IEEE Micro.

[60]  Hadrien Mélot,et al.  House of Graphs: A database of interesting graphs , 2012, Discret. Appl. Math..

[61]  Susumu Horiguchi,et al.  Shifted Recursive Torus interconnection for high performance computing , 1997, Proceedings High Performance Computing on the Information Superhighway. HPC Asia '97.

[62]  Teruaki Kitasuka,et al.  A heuristic method of generating diameter 3 graphs for order/degree problem (invited paper) , 2016, 2016 Tenth IEEE/ACM International Symposium on Networks-on-Chip (NOCS).

[63]  Peng Zhang,et al.  Interlacing Bypass Rings to Torus Networks for More Efficient Networks , 2011, IEEE Transactions on Parallel and Distributed Systems.

[64]  V. G. Cerf,et al.  A lower bound on the average shortest path length in regular graphs , 1974, Networks.

[65]  Brian W. Barrett,et al.  Introducing the Graph 500 , 2010 .

[66]  Markus Meringer,et al.  Fast generation of regular graphs and construction of cages , 1999, J. Graph Theory.

[67]  Dan Li,et al.  Impact of Network Topology on the Performance of DML: Theoretical Analysis and Practical Factors , 2019, IEEE INFOCOM 2019 - IEEE Conference on Computer Communications.

[68]  Daisuke Takahashi,et al.  The HPC Challenge (HPCC) benchmark suite , 2006, SC.

[69]  Ana Paula Couto da Silva,et al.  Performance Prediction of Cloud-Based Big Data Applications , 2018, ICPE.

[70]  Trevor Mudge,et al.  Hypercube supercomputers , 1989, Proc. IEEE.

[71]  Hideharu Amano,et al.  Prediction router: Yet another low latency on-chip router architecture , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[72]  Deron Liang,et al.  Novel Hierarchical Interconnection Networks for High-Performance Multicomputer Systems , 2004, J. Inf. Sci. Eng..