Efficient and scalable all-to-all personalized exchange for InfiniBand-based clusters

The all-to-all personalized exchange is the most dense collective communication function offered by the MPI specification. The operation involves every process sending a different message to all other participating processes. This collective operation is essential for many parallel scientific applications. With increasing system and message sizes, it becomes challenging to offer a fast, scalable and efficient implementation of this operation. InfiniBand is an emerging modern interconnect. It offers very low latency, high bandwidth and one-sided operations like RDMA write. Its advanced features like RDMA write gather allow us to design and implement all-to-all algorithms much more efficiently than in the past. Our aim in This work is to design efficient and scalable implementations of traditional personalized exchange algorithms. We present two novel approaches towards designing all-to-all algorithms for short and long messages respectively. The hypercube RDMA write gather and direct eager schemes effectively leverage the RDMA and RDMA with write gather mechanisms offered by InfiniBand. Performance evaluation of our design and implementation reveals that it is able to reduce the all-to-all communication time by upto a factor of 3.07 for 32 byte messages on a 16 node InfiniBand cluster. Our analytical models suggest that the proposed designs perform 64% better on InfiniBand clusters with 1024 nodes for 4k message size.

[1]  Wu-chun Feng,et al.  The Quadrics Network: High-Performance Clustering Technology , 2002, IEEE Micro.

[2]  Rajeev Thakur,et al.  Improving the Performance of Collective Operations in MPICH , 2003, PVM/MPI.

[3]  Rajeev Thakur,et al.  All-to-all communication on meshes with wormhole routing , 1994, Proceedings of 8th International Parallel Processing Symposium.

[4]  Yuanyuan Yang,et al.  Efficient all-to-all broadcast in all-port mesh and torus networks , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[5]  Charles L. Seitz,et al.  Myrinet: A Gigabit-per-Second Local Area Network , 1995, IEEE Micro.

[6]  Jack Dongarra,et al.  MPI - The Complete Reference: Volume 1, The MPI Core , 1998 .

[7]  Anthony Skjellum,et al.  A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..

[8]  Dhabaleswar K. Panda,et al.  All-to-all broadcast on switch-based clusters of workstations , 1999, Proceedings 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing. IPPS/SPDP 1999.

[9]  Dhabaleswar K. Panda,et al.  High performance RDMA-based MPI implementation over InfiniBand , 2003, ICS.

[10]  George Karypis,et al.  Introduction to Parallel Computing , 1994 .

[11]  S. Lennart Johnsson,et al.  Optimum Broadcasting and Personalized Communication in Hypercubes , 1989, IEEE Trans. Computers.

[12]  Dhabaleswar K. Panda,et al.  Hybrid Algorithms for Complete Exchange in 2D Meshes , 2001, IEEE Trans. Parallel Distributed Syst..