论文信息 - Accelerating Allreduce Operation: A Switch-Based Solution

Accelerating Allreduce Operation: A Switch-Based Solution

Collective operations, such as all reduce, are widely treated as the critical limiting factors in achieving high performance in massively parallel applications. Conventional host-based implementations, which introduce a large amount of point-to-point communications, are less efficient in large-scale systems. To address this issue, we propose a design of switch chip to accelerate collective operations, especially the allreduce operation. The major advantage of the proposed solution is the high scalability since expensive point-to-point communications are avoided. Two kinds of allreduce operations, namely block-allreduce and burst-allreduce, are implemented for short and long messages, respectively. We evaluated the proposed design with both a cycle-accurate simulator and a FPGA prototype system. The experimental results prove that switch-based allreduce implementation is quite efficient and scalable, especially in large-scale systems. In the prototype, our switch-based implementation significantly outperforms the host-based one, with a 16 times improvement in MPI time on 16 nodes. Furthermore, the simulation shows that, upon scaling from 2 to 4096 nodes, the switch-based allreduce latency only increases slightly by less than 2 us.

[1] Amith R. Mamidala,et al. MPI Collective Communications on The Blue Gene/P Supercomputer: Algorithms and Optimizations , 2009, 2009 17th IEEE Symposium on High Performance Interconnects.

[2] Dhabaleswar K. Panda,et al. Designing Non-blocking Allreduce with Collective Offload on InfiniBand Clusters: A Case Study with Conjugate Gradient Solvers , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[3] Dhabaleswar K. Panda,et al. Efficient collective operations using remote memory operations on VIA-based clusters , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[4] William Jalby,et al. Improving MPI communication overlap with collaborative polling , 2012, Computing.

[5] Sayantan Sur,et al. Design and Evaluation of Generalized Collective Communication Primitives with Overlap Using ConnectX-2 Offload Engine , 2010, 2010 18th IEEE Symposium on High Performance Interconnects.

[6] Dhabaleswar K. Panda,et al. Scalable NIC-based Reduction on Large-scale Clusters , 2003, International Conference on Software Composition.

[7] Rolf Rabenseifner,et al. Optimization of Collective Reduction Operations , 2004, International Conference on Computational Science.

[8] Ninghui Sun. HPP: an architecture for high performance and utility computing , 2007, China HPC.

[9] Peter Schelkens,et al. An Investigation into the Performance of Reduction Algorithms under Load Imbalance , 2012, Euro-Par.

[10] Xin Yuan,et al. Bandwidth Efficient All-reduce Operation on Tree Topologies , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[11] Xin Yuan,et al. Bandwidth optimal all-reduce algorithms for clusters of workstations , 2009, J. Parallel Distributed Comput..

[12] Jesper Larsson Träff,et al. More Efficient Reduction Algorithms for Non-Power-of-Two Number of Processors in Message-Passing Parallel Systems , 2004, PVM/MPI.

[13] Cao Zheng. Design of Barrier Network of Dawning 5000 High Performance Computer , 2008 .

[14] Philip Heidelberger,et al. Optimization of MPI collective communication on BlueGene/L systems , 2005, ICS '05.

[15] D. Panda,et al. Efficient Barrier and Allreduce on InfiniBand Clusters using Hardware Multicast and Adaptive Algorithms , 2004 .

[16] Amith R. Mamidala,et al. Efficient Barrier and Allreduce on Infiniband clusters using multicast and adaptive algorithms , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[17] Gábor Dózsa,et al. Efficient Implementation of Allreduce on BlueGene/L Collective Network , 2005, PVM/MPI.

[18] Robert A. van de Geijn,et al. Collective communication: theory, practice, and experience , 2007, Concurr. Comput. Pract. Exp..

[19] Kenichi Miura,et al. The design of ultra scalable MPI collective communication on the K computer , 2012, Computer Science - Research and Development.

[20] R. Rabenseifner,et al. Automatic MPI Counter Profiling of All Users: First Results on a CRAY T3E 900-512 , 2004 .

[21] Keith D. Underwood,et al. Implications of application usage characteristics for collective communication offload , 2006, Int. J. High Perform. Comput. Netw..

[22] Karl S. Hemmert,et al. Enabling Flexible Collective Communication Offload with Triggered Operations , 2011, 2011 IEEE 19th Annual Symposium on High Performance Interconnects.

[23] Kai Li,et al. HPP: An Architecture for High Performance and Utility Computing: HPP: An Architecture for High Performance and Utility Computing , 2009 .

[24] D. Panda,et al. NIC-Based Reduction in Myrinet Clusters: Is It Beneficial? , 2003 .

[25] Jack J. Dongarra,et al. Decision Trees and MPI Collective Algorithm Selection Problem , 2007, Euro-Par.

[26] George Bosilca,et al. Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.

[27] Rajeev Thakur,et al. Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..