Hierarchical gather/scatter algorithms with graceful degradation

Summary form only given. We present and implement simple, binomial-tree based algorithms for the gather and scatter operations of MPI (the message passing interface). For small data sets, data are gathered (scattered) in a tree-like fashion. As the size of the data increases, the algorithms gracefully degrade toward the serial algorithm in which the root process gathers (scatters) data from (to) one process after the next. We extend these algorithms to the more difficult irregular gather/scatter operations in which the processes send/receive different amounts of data. The algorithms are furthermore adopted to the hierarchical communication structure of SMP-clusters. We compare the new algorithms to the straightforward, serial implementations of the gather/scatter primitives, and demonstrate substantial improvements both on a 32-node, 2-way SMP cluster, and on a 4-node NEC SX-6 vector supercomputer with 8 processors per node. For the regular gather/scatter operations improvements of a factor of 3 to 7 are achieved for critical data sizes on the SMP-system, and a factor of 3 to 4 on the SX-6. On 256 nodes of the earth simulator the improvement for scattering small data is more than a factor of 60. Comparable improvements are achieved for the irregular operations, despite preprocessing and communication overhead for dynamic tree construction. We discuss issues in modeling and analyzing the performance of the algorithms for the irregular collectives in particular.

[1]  Bronis R. de Supinski,et al.  Exploiting hierarchy in parallel computer networks to optimize collective operation performance , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[2]  Hubert Ritzdorf,et al.  Flattening on the Fly: Efficient Handling of MPI Derived Datatypes , 1999, PVM/MPI.

[3]  S. Lennart Johnsson,et al.  Optimum Broadcasting and Personalized Communication in Hypercubes , 1989, IEEE Trans. Computers.

[4]  Rajeev Thakur,et al.  Improving the Performance of Collective Operations in MPICH , 2003, PVM/MPI.

[5]  Henri E. Bal,et al.  Bandwidth-efficient collective communication for clustered wide area systems , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[6]  Giulio Iannello,et al.  Efficient Algorithms for the Reduce-Scatter Operation in LogGP , 1997, IEEE Trans. Parallel Distributed Syst..

[7]  Henri E. Bal,et al.  MagPIe: MPI's collective communication operations for clustered wide area systems , 1999, PPoPP '99.

[8]  Paul D. Gader,et al.  Image algebra techniques for parallel image processing , 1987 .

[9]  Carl Kesselman,et al.  Generalized communicators in the Message Passing Interface , 1996, Proceedings. Second MPI Developer's Conference.

[10]  Arnold L. Rosenberg,et al.  Scattering and Gathering Messages in Networks of Processors , 1993, IEEE Trans. Computers.

[11]  Dhabaleswar K. Panda,et al.  Fast collective operations using shared and remote memory access protocols on clusters , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[12]  Jack Dongarra,et al.  MPI - The Complete Reference: Volume 1, The MPI Core , 1998 .

[13]  Chris J. Scheiman,et al.  LogGP: Incorporating Long Messages into the LogP Model for Parallel Computation , 1997, J. Parallel Distributed Comput..