论文信息 - Exploring the All-to-All Collective Optimization Space with ConnectX CORE-Direct

Exploring the All-to-All Collective Optimization Space with ConnectX CORE-Direct

The all-to-all collective communication operation is used by many scientific applications, and is one of the most time consuming and challenging collective operation to optimize. The algorithms for all-to-all operations typically fall into two classes, logarithmic and linear scaling algorithms, with Bruck's algorithm, a logarithmic scaling algorithm, used in many small-data all-to-all implementations. The recent addition of InfiniBand CORE-Direct support for network management of collective communications offers new opportunities for optimizing all-to-all operation as well as supporting truly asynchronous implementations of these operations. This paper presents several new enhancements to the Bruck small-data algorithm that leverage CORE-Direct and other InfiniBand network capabilities to produce efficient implementations of this collective operation. These include RDMA, SR-RNR, and SR-RTR algorithms. In addition, nonblocking implementations of these collective operations are also presented. Benchmark results show that the RDMA algorithm, which uses CORE-Direct capabilities to offload collective communication management to the Host Channel Adapter (HCA), hardware gather support for sending non-continuous data, and low-latency RDMA semantics, performs the best. For a 64 processes and 128 byte-per-process all-to-all, the RDMA algorithm performs 27% better than Bruck's algorithm implementation in Open MPI and 136% better than the SR-RTR algorithm. In addition, the nonblocking versions of these algorithms have the same performance characteristics as the blocking algorithms. Finally, measurements of computation/communication overlap capacity show that all offloaded algorithms achieve about 98% overlap for large data all-to-all, whereas implementations using host-based progress achieve only about 9.5% overlap.

[1] Manjunath Gorentla Venkata,et al. Cheetah: A Framework for Scalable Hierarchical Collective Operations , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[2] Sayantan Sur,et al. High-performance and scalable non-blocking all-to-all with collective offload on InfiniBand clusters: a study with parallel 3D FFT , 2011, Computer Science - Research and Development.

[3] Steve Poole,et al. ConnectX-2 InfiniBand Management Queues: First Investigation of the New Support for Network Offloaded Collective Operations , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[4] Scott Pakin,et al. The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8, 192 Processors of ASCI Q , 2003, SC.

[5] Manjunath Gorentla Venkata,et al. ConnectX-2 CORE-Direct Enabled Asynchronous Broadcast Collective Communications , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[6] Ronald Mraz,et al. Reducing the variance of point to point transfers in the IBM 9076 parallel computer , 1994, Proceedings of Supercomputing '94.

[7] Eli Upfal,et al. Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems , 1997, IEEE Trans. Parallel Distributed Syst..

[8] Roger W. Hockney,et al. The Communication Challenge for MPP: Intel Paragon and Meiko CS-2 , 1994, Parallel Computing.

[9] George Bosilca,et al. Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.

[10] Jehoshua Bruck,et al. Efficient algorithms for all-to-all communications in multi-port message-passing systems , 1994, SPAA '94.

[11] Torsten Hoefler,et al. Optimizing non-blocking collective operations for infiniband , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[12] Ying Qian,et al. Design and Evaluation of Efficient Collective Communications on Modern Interconnects and Multi-core Clusters , 2010 .

[13] Christopher Wilson,et al. COMB: a portable benchmark suite for assessing MPI overlap , 2002, Proceedings. IEEE International Conference on Cluster Computing.

[14] D. Panda,et al. High Performance RDMA Based All-to-All Broadcast for InfiniBand Clusters , 2005, HiPC.

[15] Rajeev Thakur,et al. Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..