Exploring the All-to-All Collective Optimization Space with ConnectX CORE-Direct

The all-to-all collective communication operation is used by many scientific applications, and is one of the most time consuming and challenging collective operation to optimize. The algorithms for all-to-all operations typically fall into two classes, logarithmic and linear scaling algorithms, with Bruck's algorithm, a logarithmic scaling algorithm, used in many small-data all-to-all implementations. The recent addition of InfiniBand CORE-Direct support for network management of collective communications offers new opportunities for optimizing all-to-all operation as well as supporting truly asynchronous implementations of these operations. This paper presents several new enhancements to the Bruck small-data algorithm that leverage CORE-Direct and other InfiniBand network capabilities to produce efficient implementations of this collective operation. These include RDMA, SR-RNR, and SR-RTR algorithms. In addition, nonblocking implementations of these collective operations are also presented. Benchmark results show that the RDMA algorithm, which uses CORE-Direct capabilities to offload collective communication management to the Host Channel Adapter (HCA), hardware gather support for sending non-continuous data, and low-latency RDMA semantics, performs the best. For a 64 processes and 128 byte-per-process all-to-all, the RDMA algorithm performs 27% better than Bruck's algorithm implementation in Open MPI and 136% better than the SR-RTR algorithm. In addition, the nonblocking versions of these algorithms have the same performance characteristics as the blocking algorithms. Finally, measurements of computation/communication overlap capacity show that all offloaded algorithms achieve about 98% overlap for large data all-to-all, whereas implementations using host-based progress achieve only about 9.5% overlap.

[1]  Manjunath Gorentla Venkata,et al.  Cheetah: A Framework for Scalable Hierarchical Collective Operations , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[2]  Sayantan Sur,et al.  High-performance and scalable non-blocking all-to-all with collective offload on InfiniBand clusters: a study with parallel 3D FFT , 2011, Computer Science - Research and Development.

[3]  Steve Poole,et al.  ConnectX-2 InfiniBand Management Queues: First Investigation of the New Support for Network Offloaded Collective Operations , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[4]  Scott Pakin,et al.  The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8, 192 Processors of ASCI Q , 2003, SC.

[5]  Manjunath Gorentla Venkata,et al.  ConnectX-2 CORE-Direct Enabled Asynchronous Broadcast Collective Communications , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[6]  Ronald Mraz,et al.  Reducing the variance of point to point transfers in the IBM 9076 parallel computer , 1994, Proceedings of Supercomputing '94.

[7]  Eli Upfal,et al.  Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems , 1997, IEEE Trans. Parallel Distributed Syst..

[8]  Roger W. Hockney,et al.  The Communication Challenge for MPP: Intel Paragon and Meiko CS-2 , 1994, Parallel Computing.

[9]  George Bosilca,et al.  Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.

[10]  Jehoshua Bruck,et al.  Efficient algorithms for all-to-all communications in multi-port message-passing systems , 1994, SPAA '94.

[11]  Torsten Hoefler,et al.  Optimizing non-blocking collective operations for infiniband , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[12]  Ying Qian,et al.  Design and Evaluation of Efficient Collective Communications on Modern Interconnects and Multi-core Clusters , 2010 .

[13]  Christopher Wilson,et al.  COMB: a portable benchmark suite for assessing MPI overlap , 2002, Proceedings. IEEE International Conference on Cluster Computing.

[14]  D. Panda,et al.  High Performance RDMA Based All-to-All Broadcast for InfiniBand Clusters , 2005, HiPC.

[15]  Rajeev Thakur,et al.  Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..