Accelerating Allreduce Operation: A Switch-Based Solution

Collective operations, such as all reduce, are widely treated as the critical limiting factors in achieving high performance in massively parallel applications. Conventional host-based implementations, which introduce a large amount of point-to-point communications, are less efficient in large-scale systems. To address this issue, we propose a design of switch chip to accelerate collective operations, especially the allreduce operation. The major advantage of the proposed solution is the high scalability since expensive point-to-point communications are avoided. Two kinds of allreduce operations, namely block-allreduce and burst-allreduce, are implemented for short and long messages, respectively. We evaluated the proposed design with both a cycle-accurate simulator and a FPGA prototype system. The experimental results prove that switch-based allreduce implementation is quite efficient and scalable, especially in large-scale systems. In the prototype, our switch-based implementation significantly outperforms the host-based one, with a 16 times improvement in MPI time on 16 nodes. Furthermore, the simulation shows that, upon scaling from 2 to 4096 nodes, the switch-based allreduce latency only increases slightly by less than 2 us.

[1]  Amith R. Mamidala,et al.  MPI Collective Communications on The Blue Gene/P Supercomputer: Algorithms and Optimizations , 2009, 2009 17th IEEE Symposium on High Performance Interconnects.

[2]  Dhabaleswar K. Panda,et al.  Designing Non-blocking Allreduce with Collective Offload on InfiniBand Clusters: A Case Study with Conjugate Gradient Solvers , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[3]  Dhabaleswar K. Panda,et al.  Efficient collective operations using remote memory operations on VIA-based clusters , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[4]  William Jalby,et al.  Improving MPI communication overlap with collaborative polling , 2012, Computing.

[5]  Sayantan Sur,et al.  Design and Evaluation of Generalized Collective Communication Primitives with Overlap Using ConnectX-2 Offload Engine , 2010, 2010 18th IEEE Symposium on High Performance Interconnects.

[6]  Dhabaleswar K. Panda,et al.  Scalable NIC-based Reduction on Large-scale Clusters , 2003, International Conference on Software Composition.

[7]  Rolf Rabenseifner,et al.  Optimization of Collective Reduction Operations , 2004, International Conference on Computational Science.

[8]  Ninghui Sun HPP: an architecture for high performance and utility computing , 2007, China HPC.

[9]  Peter Schelkens,et al.  An Investigation into the Performance of Reduction Algorithms under Load Imbalance , 2012, Euro-Par.

[10]  Xin Yuan,et al.  Bandwidth Efficient All-reduce Operation on Tree Topologies , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[11]  Xin Yuan,et al.  Bandwidth optimal all-reduce algorithms for clusters of workstations , 2009, J. Parallel Distributed Comput..

[12]  Jesper Larsson Träff,et al.  More Efficient Reduction Algorithms for Non-Power-of-Two Number of Processors in Message-Passing Parallel Systems , 2004, PVM/MPI.

[13]  Cao Zheng Design of Barrier Network of Dawning 5000 High Performance Computer , 2008 .

[14]  Philip Heidelberger,et al.  Optimization of MPI collective communication on BlueGene/L systems , 2005, ICS '05.

[15]  D. Panda,et al.  Efficient Barrier and Allreduce on InfiniBand Clusters using Hardware Multicast and Adaptive Algorithms , 2004 .

[16]  Amith R. Mamidala,et al.  Efficient Barrier and Allreduce on Infiniband clusters using multicast and adaptive algorithms , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[17]  Gábor Dózsa,et al.  Efficient Implementation of Allreduce on BlueGene/L Collective Network , 2005, PVM/MPI.

[18]  Robert A. van de Geijn,et al.  Collective communication: theory, practice, and experience , 2007, Concurr. Comput. Pract. Exp..

[19]  Kenichi Miura,et al.  The design of ultra scalable MPI collective communication on the K computer , 2012, Computer Science - Research and Development.

[20]  R. Rabenseifner,et al.  Automatic MPI Counter Profiling of All Users: First Results on a CRAY T3E 900-512 , 2004 .

[21]  Keith D. Underwood,et al.  Implications of application usage characteristics for collective communication offload , 2006, Int. J. High Perform. Comput. Netw..

[22]  Karl S. Hemmert,et al.  Enabling Flexible Collective Communication Offload with Triggered Operations , 2011, 2011 IEEE 19th Annual Symposium on High Performance Interconnects.

[23]  Kai Li,et al.  HPP: An Architecture for High Performance and Utility Computing: HPP: An Architecture for High Performance and Utility Computing , 2009 .

[24]  D. Panda,et al.  NIC-Based Reduction in Myrinet Clusters: Is It Beneficial? , 2003 .

[25]  Jack J. Dongarra,et al.  Decision Trees and MPI Collective Algorithm Selection Problem , 2007, Euro-Par.

[26]  George Bosilca,et al.  Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.

[27]  Rajeev Thakur,et al.  Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..