Design and Analysis of Pipelined Broadcast Algorithms for the All-Port Interlaced Bypass Torus Networks

Broadcast algorithms for the interlaced bypass torus networks (iBT networks) are introduced to balance the all-port bandwidth efficiency and to avoid congestion in multidimensional cases. With these algorithms, we numerically analyze the dependencies of the broadcast efficiencies on various packet-sending patterns, bypass schemes, network sizes, and dimensionalities and then strategically tune up the configurations for minimizing the broadcast steps. Leveraging on such analysis, we compare the performance of networks with one million nodes between two cases: one with an added fixed-length bypass links and the other with an added torus dimension. A case study of iBT(10002;b = (8,32)) and Torus(1003) shows that the former improves the diameter, average node-to-node distance, rectangular and global broadcasts over the latter by approximately 80 percent. It is reaffirmed that strategically interlacing short bypass links and methodically utilizing these links is superior to adding dimensionalities to torus in achieving shorter diameter, average node-to-node distances and faster broadcasts.

[1]  Dennis Gannon,et al.  On the Impact of Communication Complexity on the Design of Parallel Numerical Algorithms , 1984, IEEE Transactions on Computers.

[2]  Amith R. Mamidala,et al.  MPI Collectives on Modern Multicore Clusters: Performance Optimizations and Communication Characteristics , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[3]  Sathish S. Vadhiyar,et al.  Automatically Tuned Collective Communications , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[4]  G. Johnson,et al.  A Performance Comparison Through Benchmarking and Modeling of Three Leading Supercomputers: Blue Gene/L, Red Storm, and Purple , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[5]  Erich Strohmaier,et al.  High-performance computing: clusters, constellations, MPPs, and future directions , 2003, Comput. Sci. Eng..

[6]  Rolf Rabenseifner,et al.  Automatic Profiling of MPI Applications with Hardware Performance Counters , 1999, PVM/MPI.

[7]  Sudhakar Yalamanchili,et al.  Interconnection Networks: An Engineering Approach , 2002 .

[8]  José Duato,et al.  Adaptive bubble router: a design to improve performance in torus networks , 1999, Proceedings of the 1999 International Conference on Parallel Processing.

[9]  Viktor K. Prasanna,et al.  Portable and Scalable Algorithm for Irregular All-to-All Communication , 2002, J. Parallel Distributed Comput..

[10]  Robert A. van de Geijn,et al.  Broadcasting on Meshes with Wormhole Routing , 1996, J. Parallel Distributed Comput..

[11]  Rajeev Thakur,et al.  All-to-all communication on meshes with wormhole routing , 1994, Proceedings of 8th International Parallel Processing Symposium.

[12]  Ben H. H. Juurlink,et al.  Gossiping on Meshes and Tori , 1998, IEEE Trans. Parallel Distributed Syst..

[13]  Robert S. Germain,et al.  Performance Measurements of the 3D FFT on the Blue Gene/L Supercomputer , 2005, Euro-Par.

[14]  Robert A. van de Geijn,et al.  Fast Collective Communication Libraries, Please , 1995 .

[15]  David F. Heidel,et al.  An Overview of the BlueGene/L Supercomputer , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[16]  Sathish S. Vadhiyar,et al.  ACCT: Automatic Collective Communications Tuning , 2000, PVM/MPI.

[17]  Robert A. van de Geijn,et al.  On optimizing collective communication , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[18]  Rajeev Thakur,et al.  Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..

[19]  Robert A. van de Geijn,et al.  Global Combine Algorithms for 2-D Meshes with Wormhole Routing , 1995, J. Parallel Distributed Comput..

[20]  Cruz Izu,et al.  The Adaptive Bubble Router , 2001, J. Parallel Distributed Comput..

[21]  William Gropp,et al.  Design and implementation of message-passing services for the Blue Gene/L supercomputer , 2005, IBM J. Res. Dev..

[22]  Stéphane Pérennes,et al.  All-to-all broadcast in torus with wormhole-like routing , 1995, Proceedings.Seventh IEEE Symposium on Parallel and Distributed Processing.

[23]  Philip Heidelberger,et al.  Optimization of All-to-All Communication on the Blue Gene/L Supercomputer , 2008, 2008 37th International Conference on Parallel Processing.

[24]  Jack J. Dongarra,et al.  Performance analysis of MPI collective operations , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[25]  Xin Yuan,et al.  Automatic generation and tuning of MPI collective communication routines , 2005, ICS '05.

[26]  Xin Yuan,et al.  STAR-MPI: self tuned adaptive routines for MPI collective operations , 2006, ICS '06.

[27]  Ulrich Meyer,et al.  Time-independent gossiping on full-port tori , 1998 .

[28]  Yuanyuan Yang,et al.  Near-optimal all-to-all broadcast in multidimensional all-port meshes and tori , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[29]  Lars Paul Huse Collective Communication on Dedicated Clusters of Workstations , 1999, PVM/MPI.

[30]  Steven L. Scott,et al.  The Cray T3E Network: Adaptive Routing in a High Performance 3D Torus , 1996 .

[31]  Philip Heidelberger,et al.  Blue Gene/L torus interconnection network , 2005, IBM J. Res. Dev..

[32]  Robert A. van de Geijn,et al.  Collective communication on architectures that support simultaneous communication over multiple links , 2006, PPoPP '06.

[33]  Rajeev Thakur,et al.  Improving the Performance of Collective Operations in MPICH , 2003, PVM/MPI.

[34]  William J. Dally,et al.  Principles and Practices of Interconnection Networks , 2004 .

[35]  Peng Zhang,et al.  Interlacing Bypass Rings to Torus Networks for More Efficient Networks , 2011, IEEE Transactions on Parallel and Distributed Systems.

[36]  Yuanyuan Yang,et al.  Pipelined All-to-All Broadcast in All-Port Meshes and Tori , 2001, IEEE Trans. Computers.

[37]  Henri E. Bal,et al.  Bandwidth-efficient collective communication for clustered wide area systems , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[38]  Amith R. Mamidala,et al.  MPI Collective Communications on The Blue Gene/P Supercomputer: Algorithms and Optimizations , 2009, 2009 17th IEEE Symposium on High Performance Interconnects.

[39]  Philip Heidelberger,et al.  Optimization of MPI collective communication on BlueGene/L systems , 2005, ICS '05.

[40]  Jack J. Dongarra,et al.  Performance Analysis of MPI Collective Operations , 2005, IPDPS.

[41]  Jack J. Dongarra,et al.  The LINPACK Benchmark: past, present and future , 2003, Concurr. Comput. Pract. Exp..