Scalable collective message-passing algorithms

Governments, universities, and companies expend vast resources building the top supercomputers. The processors and interconnect networks become faster, while the number of nodes grows exponentially. Problems of scale emerge, not least of which is collective performance. This thesis identifies and proposes solutions for two major scalability problems. Our first contribution is a novel algorithm for process-partitioning and remapping for exascale systems that has far better time and space scaling than known algorithms. Our evaluations predict an improvement of up to 60x for large exascale systems and arbitrary reduction in the large temporary buffer space required for generating new communicators. Our second contribution consists of several novel collective algorithms for Clos and torus networks. Known allgather, reduce-scatter, and composite algorithms for Clos networks suffer the worst congestion when the largest messages are exchanged, damaging performance. Known algorithms for torus networks use only one network port, regardless of how many are available. Unlike known algorithms, our algorithms have a small amount of redundant communication. Unlike known algorithms, our algorithms can be reordered so that congestion hinders small messages rather than large, and all ports can be fully used on multi-port torus networks. The redundant communication gives us this flexibility. On a 32k-node system, we deliver improvements of up to 11x for the reduce-scatter operation, when the native reduce-satter algorithm does not use special hardware, and 5.5x for the allgather operation. We show large improvements over native algorithms with as few as 16 processors.

[1]  Laxmikant V. Kale,et al.  Automating Topology Aware Mapping for Supercomputers , 2010 .

[2]  Felix Wolf,et al.  Parallel Sorting with Minimal Data , 2011, EuroMPI.

[3]  Eitan Zahavi,et al.  Fat-Trees Routing and Node Ordering Providing Contention Free Traffic for MPI Global Collectives , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[4]  Rajeev Thakur,et al.  Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..

[5]  George Bosilca,et al.  Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.

[6]  Yogish Sabharwal,et al.  Optimal bucket algorithms for large MPI collectives on torus interconnects , 2010, ICS '10.

[7]  Jonathan Schaeffer,et al.  Parallel Sorting by Regular Sampling , 1992, J. Parallel Distributed Comput..

[8]  Rajeev Thakur,et al.  Improving the Performance of Collective Operations in MPICH , 2003, PVM/MPI.

[9]  Ramesh Subramonian,et al.  LogP: towards a realistic model of parallel computation , 1993, PPOPP '93.

[10]  Torsten Hoefler,et al.  Multistage switches are not crossbars: Effects of static routing in high-performance networks , 2008, 2008 IEEE International Conference on Cluster Computing.

[11]  Jesper Larsson Träff,et al.  A Pipelined Algorithm for Large, Irregular All-Gather Problems , 2010, Int. J. High Perform. Comput. Appl..

[12]  Friedhelm Meyer auf der Heide,et al.  Optimal broadcast on parallel locality models , 2003, J. Discrete Algorithms.

[13]  David S. Wise Ahnentafel Indexing into Morton-Ordered Arrays, or Matrix Locality for Free , 2000, Euro-Par.

[14]  Herb Sutter,et al.  The Free Lunch Is Over A Fundamental Turn Toward Concurrency in Software , 2013 .

[15]  Tao Yang,et al.  Optimizing threaded MPI execution on SMP clusters , 2001, ICS '01.

[16]  Richard M. Karp,et al.  Optimal broadcast and summation in the LogP model , 1993, SPAA '93.

[17]  Torsten Hoefler,et al.  The PERCS High-Performance Interconnect , 2010, 2010 18th IEEE Symposium on High Performance Interconnects.

[18]  Rajendra Akerkar,et al.  Reconfigurable Architectures and Algorithms: A Research Survey , 2009, Int. J. Comput. Sci. Appl..

[19]  Anthony Skjellum,et al.  A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..

[20]  S. W. Song,et al.  A Note on Parallel Selection on Coarse-Grained Multicomputers , 1999, Algorithmica.

[21]  Torsten Hoefler,et al.  Characterizing the Influence of System Noise on Large-Scale Applications by Simulation , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[22]  Kamil Iskra,et al.  Characterizing the Performance of “Big Memory” on Blue Gene Linux , 2009, 2009 International Conference on Parallel Processing Workshops.

[23]  Philip Heidelberger,et al.  Optimization of MPI collective communication on BlueGene/L systems , 2005, ICS '05.

[24]  Ibm Blue,et al.  Overview of the IBM Blue Gene/P Project , 2008, IBM J. Res. Dev..

[25]  Jehoshua Bruck,et al.  CCL: A Portable and Tunable Collective Communication Library for Scalable Parallel Computers , 1995, IEEE Trans. Parallel Distributed Syst..

[26]  Viral B. Shah,et al.  A Novel Parallel Sorting Algorithm for Contemporary Architectures , 2007 .

[27]  J. Watts,et al.  Interprocessor collective communication library (InterCom) , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.

[28]  Robert A. van de Geijn,et al.  Global combine on mesh architectures with wormhole routing , 1993, [1993] Proceedings Seventh International Parallel Processing Symposium.

[29]  William Gropp,et al.  User's Guide for mpich, a Portable Implementation of MPI Version 1.2.2 , 1996 .

[30]  Torsten Hoefler,et al.  Sparse collective operations for MPI , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[31]  Henri E. Bal,et al.  MagPIe: MPI's collective communication operations for clustered wide area systems , 1999, PPoPP '99.

[32]  Xin Yuan,et al.  Bandwidth Efficient All-reduce Operation on Tree Topologies , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[33]  Torsten Hoefler,et al.  Adaptive Routing Strategies for Modern High Performance Networks , 2008, 2008 16th IEEE Symposium on High Performance Interconnects.

[34]  Amith R. Mamidala,et al.  MPI Collective Communications on The Blue Gene/P Supercomputer: Algorithms and Optimizations , 2009, Hot Interconnects.

[35]  Laxmikant V. Kalé,et al.  Highly scalable parallel sorting , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[36]  Richard Stong Hamilton decompositions of cartesian products of graphs , 1991, Discret. Math..

[37]  Kyung-Yong Chwa,et al.  Optimal Embedding of Multiple Directed Hamiltonian Rings into d-dimensional Meshes , 2000, J. Parallel Distributed Comput..

[38]  Gianfranco Bilardi,et al.  Broadcast and Associative Operations on Fat-Trees , 1997, Euro-Par.

[39]  Robert A. van de Geijn,et al.  Optimal Broadcasting in Mesh-Connected Architectures , 1991 .

[40]  Bruce M. Maggs,et al.  Communication-efficient parallel algorithms for distributed random-access machines , 1988, Algorithmica.

[41]  Jung Ho Ahn,et al.  HyperX: topology, routing, and packaging of efficient large-scale networks , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[42]  William Gropp,et al.  A Scalable MPI_Comm_split Algorithm for Exascale Computing , 2010, EuroMPI.

[43]  Bronis R. de Supinski,et al.  Exascale Algorithms for Generalized MPI_Comm_split , 2011, EuroMPI.

[44]  Laxmikant V. Kalé,et al.  Scaling an optimistic parallel simulation of large-scale interconnection networks , 2005, Proceedings of the Winter Simulation Conference, 2005..