Topology-oblivious optimization of MPI broadcast algorithms on extreme-scale platforms

Significant research has been conducted in collective communication operations, in particular in MPI broadcast, on distributed memory platforms. Most of the research efforts aim to optimize the collective operations for particular architectures by taking into account either their topology or platform parameters. In this work we propose a simple but general approach to optimization of the legacy MPI broadcast algorithms, which are widely used in MPICH and Open MPI. The proposed optimization technique is designed to address the challenge of extreme scale of future HPC platforms. It is based on hierarchical transformation of the traditionally flat logical arrangement of communicating processors. Theoretical analysis and experimental results on IBM BlueGene/P and a cluster of the Grid’5000 platform are presented.

[1]  Torsten Hoefler,et al.  A practically constant-time MPI Broadcast Algorithm for large-scale InfiniBand Clusters with Multicast , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[2]  Jelena Pjesivac-Grbovic,et al.  Towards Automatic and Adaptive Optimizations of MPI Collective Operations , 2007 .

[3]  Alexey L. Lastovetsky,et al.  High-Level Topology-Oblivious Optimization of MPI Broadcast Algorithms on Extreme-Scale Platforms , 2014, Euro-Par Workshops.

[4]  Alexey L. Lastovetsky,et al.  MPIBlib: Benchmarking MPI Communications for Parallel Computing on Homogeneous and Heterogeneous Clusters , 2008, PVM/MPI.

[5]  Robert A. van de Geijn,et al.  A Pipelined Broadcast for Multidimensional Meshes , 1995, Parallel Process. Lett..

[6]  Jesper Larsson Träff,et al.  Optimal broadcast for fully connected processor-node networks , 2008, J. Parallel Distributed Comput..

[7]  Manjunath Gorentla Venkata,et al.  Cheetah: A Framework for Scalable Hierarchical Collective Operations , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[8]  Alexey L. Lastovetsky,et al.  Hierarchical approach to optimization of parallel matrix multiplication on large-scale platforms , 2015, The Journal of Supercomputing.

[9]  Kiril Dichev,et al.  Optimization of Collective Communication for Heterogeneous HPC Platforms , 2014, HiPC 2014.

[10]  J. Watts,et al.  Interprocessor collective communication library (InterCom) , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.

[11]  Peter Sanders,et al.  A bandwidth latency tradeoff for broadcast and reduction , 2003, Inf. Process. Lett..

[12]  Rajeev Thakur,et al.  Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..

[13]  Dhabaleswar K. Panda,et al.  Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters , 2013, 2013 IEEE 21st Annual Symposium on High-Performance Interconnects.

[14]  George Bosilca,et al.  Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.

[15]  Philip Heidelberger,et al.  The deep computing messaging framework: generalized scalable message passing on the blue gene/P supercomputer , 2008, ICS '08.

[16]  Amith R. Mamidala,et al.  Fast and scalable MPI-level broadcast using InfiniBand's hardware multicast support , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[17]  Roger W. Hockney,et al.  The Communication Challenge for MPP: Intel Paragon and Meiko CS-2 , 1994, Parallel Computing.

[18]  S. Lennart Johnsson,et al.  Optimum Broadcasting and Personalized Communication in Hypercubes , 1989, IEEE Trans. Computers.