Runtime Optimization of Broadcast Communications Using Dynamic Network Topology Information from MPI

Modern commodity compute clusters are often composed of many multi-core nodes, that are connected via a network to each other. On multi-core clusters, inter-node network communications are typically an order of magnitude slower than those between processes on the same node, which effectively creates a heterogeneous, tiered network topology. Presently, most MPI implementations assume a homogeneous network composition, which causes them to have less than optimal performance on multi-core clusters. In this paper, we treat a multi-core cluster as a heterogeneous cluster and optimize the performance of MPI broadcast communications by scheduling messages according to topology information. We experimentally demonstrate that previous heuristics for heterogeneous clusters such as Fastest Edge First (FEF) do not produce optimal results on multi-core clusters for broadcast communications. Our solution is to modify the Fastest Edge First heuristic by imposing an additional constraint, that permits only one core per node to participate in inter-node communications, creating a nested binomial tree structure. Using this constraint we are able to achieve performance gains of 20%-60% over the MPI broadcast implementation on homogeneous, multi-core clusters.

[1]  Dhabaleswar K. Panda,et al.  Fast collective operations using shared and remote memory access protocols on clusters , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[2]  Xin Yuan,et al.  Automatic generation and tuning of MPI collective communication routines , 2005, ICS '05.

[3]  Katherine Yelick,et al.  Optimizing collective communication on multicores , 2009 .

[4]  Viktor K. Prasanna,et al.  Efficient collective communication in distributed heterogeneous systems , 1999, Proceedings. 19th IEEE International Conference on Distributed Computing Systems (Cat. No.99CB37003).

[5]  Dhabaleswar K. Panda,et al.  Efficient collective communication on heterogeneous networks of workstations , 1998, Proceedings. 1998 International Conference on Parallel Processing (Cat. No.98EX205).

[6]  Xin Yuan,et al.  An MPI tool for automatically discovering the switch level topologies of Ethernet clusters , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[7]  Pangfeng Liu,et al.  Broadcast scheduling optimization for heterogeneous cluster systems , 2000, SPAA '00.

[8]  Henri E. Bal,et al.  MagPIe: MPI's collective communication operations for clustered wide area systems , 1999, PPoPP '99.

[9]  Jin Zhang,et al.  Process Mapping for MPI Collective Communications , 2009, Euro-Par.

[10]  Amith R. Mamidala,et al.  MPI Collectives on Modern Multicore Clusters: Performance Optimizations and Communication Characteristics , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[11]  Sathish S. Vadhiyar,et al.  Automatically Tuned Collective Communications , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[12]  Xiaofang Zhao,et al.  Performance analysis and optimization of MPI collective operations on multi-core clusters , 2009, The Journal of Supercomputing.

[13]  Arun Chauhan,et al.  A Model for Communication in Clusters of Multi-core Machines , 2008, 2008 IEEE Fourth International Conference on eScience.

[14]  Meng-Shiou Wu,et al.  Optimizing collective communications on SMP clusters , 2005, 2005 International Conference on Parallel Processing (ICPP'05).

[15]  George Bosilca,et al.  Open MPI: A High-Performance, Heterogeneous MPI , 2006, 2006 IEEE International Conference on Cluster Computing.