The design and implementation of MPI collective operations for clusters in long-and-fast networks

Abstract Several MPI systems for Grid environment, in which clusters are connected by wide-area networks, have been proposed. However, the algorithms of collective communication in such MPI systems assume relatively low bandwidth wide-area networks, and they are not designed for the fast wide-area networks that are becoming available. On the other hand, for cluster MPI systems, a bcast algorithm by van de Geijn, et al. and an allreduce algorithm by Rabenseifner have been proposed, which are efficient in a high bi-section bandwidth environment. We modify those algorithms so as to effectively utilize fast wide-area inter-cluster networks and to control the number of nodes which can transfer data simultaneously through wide-area networks to avoid congestion. We confirmed the effectiveness of the modified algorithms by experiments using a 10 Gbps emulated WAN environment. The environment consists of two clusters, where each cluster consists of nodes with 1 Gbps Ethernet links and a switch with a 10 Gbps upper link. The two clusters are connected through a 10 Gbps WAN emulator which can insert latency. In a 10 millisecond latency environment, when the message size is 32 MB, the proposed bcast and allreduce are 1.6 and 3.2 times faster, respectively, than the algorithms used in existing MPI systems for Grid environment.

[1]  Ryousei Takano,et al.  Design and Evaluation of Precise Software Pacing Mechanisms for Fast Long-Distance Networks , 2005 .

[2]  Robert A. van de Geijn,et al.  Building a high-performance collective communication library , 1994, Proceedings of Supercomputing '94.

[3]  Rolf Rabenseifner,et al.  Optimization of Collective Reduction Operations , 2004, International Conference on Computational Science.

[4]  Rajeev Thakur,et al.  Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..

[5]  Yuetsu Kodama,et al.  TCP Adaptation for MPI on Long-and-Fat Networks , 2005, 2005 IEEE International Conference on Cluster Computing.

[6]  R. Rabenseifner,et al.  Automatic MPI Counter Profiling of All Users: First Results on a CRAY T3E 900-512 , 2004 .

[7]  Motohiko Matsuda,et al.  Evaluation of MPI implementations on grid-connected clusters using an emulated WAN environment , 2003, CCGrid 2003. 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003. Proceedings..

[8]  Michael M. Resch,et al.  Implementing MPI with Optimized Algorithms for Metacomputing , 1999 .

[9]  Yuetsu Kodama,et al.  GNET-1: gigabit Ethernet network testbed , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[10]  Henri E. Bal,et al.  Bandwidth-efficient collective communication for clustered wide area systems , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[11]  Bronis R. de Supinski,et al.  A Multilevel Approach to Topology-Aware Collective Operations in Computational Grids , 2002, ArXiv.

[12]  J. Watts,et al.  Interprocessor collective communication library (InterCom) , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.

[13]  Robert A. van de Geijn,et al.  Collective communication on architectures that support simultaneous communication over multiple links , 2006, PPoPP '06.

[14]  Mark Allman,et al.  An Application-Level solution to TCP''s Satellite Inefficiencies , 1996 .