Global Combine Algorithms for 2-D Meshes with Wormhole Routing

The problem of performing a global combine (summation) operation on a distributed memory computer using a two-dimensional mesh interconnect with wormhole routing is considered. We present algorithms that are asymptotically optimal for short vectors (O(log(p)) for p processing nodes) and for long vectors (O(n) for n data elements per node), as well as hybrid algorithms that are superior for intermediate n. The algorithms are analyzed using detailed performance models that include the effects of link conflicts and other characteristics of the underlying communication system. The models are validated using experimental data from the Intel Touchstone DELTA computer. We show that no one algorithm is optimal for all vector lengths; rather, each of the presented algorithms is superior under some circumstances.