Collective algorithms for sub-communicators

Collective communication over a group of processors is an integral and time consuming component in many high performance computing applications. Many modern day super- computers are based on torus interconnects and near optimal algorithms have been developed for collective communication over regular communicators on these systems. However, for an irregular communicator comprising of a subset of processors, the algorithms developed so far are not contention free in general and hence non-optimal. In this paper, we present a novel contention-free algorithm to perform collective operations over a subset of processors in a torus network. We also extend previous work on regular communicators to handle special cases of irregular communicators that occur frequently in parallel scientific applications. For the generic case where multiple node disjoint sub-communicators communicate simultaneously in a loosely synchronous fashion, we propose a novel cooperative approach to route the data for individual sub- communicators without contention. Empirical results demon- strate that our algorithms outperform the optimized MPI collective implementation on IBM's Blue Gene/P supercomputer for large data sizes and random node distributions.

[1]  Collin McCurdy,et al.  Early evaluation of IBM BlueGene/P , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[2]  GroppWilliam,et al.  Optimization of Collective Communication Operations in MPICH , 2005 .

[3]  Amith R. Mamidala,et al.  MPI Collective Communications on The Blue Gene/P Supercomputer: Algorithms and Optimizations , 2009, 2009 17th IEEE Symposium on High Performance Interconnects.

[4]  Ibm Blue,et al.  Overview of the IBM Blue Gene/P Project , 2008, IBM J. Res. Dev..

[5]  Collin McCurdy,et al.  Early evaluation of IBM BlueGene/P , 2008, HiPC 2008.

[6]  Rajeev Thakur,et al.  Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..

[7]  Rajeev Thakur,et al.  Improving the Performance of Collective Operations in MPICH , 2003, PVM/MPI.

[8]  R. V. D. Geijn,et al.  Collective Communication : Theory , Practice , and Experience FLAME Working Note # 22 , 2006 .

[9]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[10]  Rolf Rabenseifner,et al.  Automatic Profiling of MPI Applications with Hardware Performance Counters , 1999, PVM/MPI.

[11]  Laxmikant V. Kalé,et al.  Achieving strong scaling with NAMD on Blue Gene/L , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[12]  Yogish Sabharwal,et al.  Optimal bucket algorithms for large MPI collectives on torus interconnects , 2010, ICS '10.

[13]  Philip Heidelberger,et al.  The deep computing messaging framework: generalized scalable message passing on the blue gene/P supercomputer , 2008, ICS '08.

[14]  Robert A. van de Geijn,et al.  CollMark: MPI Collective Communication Benchmark , 2000 .