Minimizing Communication in All-Pairs Shortest Paths

We consider distributed memory algorithms for the all-pairs shortest paths (APSP) problem. Scaling the APSP problem to high concurrencies requires both minimizing inter-processor communication as well as maximizing temporal data locality. The 2.5D APSP algorithm, which is based on the divide-and-conquer paradigm, satisfies both of these requirements: it can utilize any extra available memory to perform asymptotically less communication, and it is rich in semiring matrix multiplications, which have high temporal locality. We start by introducing a block-cyclic 2D (minimal memory) APSP algorithm. With a careful choice of block-size, this algorithm achieves known communication lower-bounds for latency and bandwidth. We extend this 2D block-cyclic algorithm to a 2.5D algorithm, which can use c extra copies of data to reduce the bandwidth cost by a factor of c1/2, compared to its 2D counterpart. However, the 2.5D algorithm increases the latency cost by c1/2. We provide a tighter lower bound on latency, which dictates that the latency overhead is necessary to reduce bandwidth along the critical path of execution. Our implementation achieves impressive performance and scaling to 24,576 cores of a Cray XE6 supercomputer by utilizing well-tuned intra-node kernels within the distributed memory algorithm.

[1]  Sartaj Sahni,et al.  Parallel Matrix and Graph Algorithms , 1981, SIAM J. Comput..

[2]  Don Coppersmith,et al.  Matrix multiplication via arithmetic progressions , 1987, STOC.

[3]  Stephen Warshall,et al.  A Theorem on Boolean Matrices , 1962, JACM.

[4]  Ramesh C. Agarwal,et al.  A three-dimensional approach to parallel matrix multiplication , 1995, IBM J. Res. Dev..

[5]  Peter Sanders,et al.  Contraction Hierarchies: Faster and Simpler Hierarchical Routing in Road Networks , 2008, WEA.

[6]  Peter Sanders,et al.  [Delta]-stepping: a parallelizable shortest path algorithm , 2003, J. Algorithms.

[7]  Srinivasan Keshav,et al.  REWIRE: An Optimization-based Framework for Data Center Network Design , 2011 .

[8]  P. Sadayappan,et al.  Communication-Efficient Matrix Multiplication on Hypercubes , 1996, Parallel Comput..

[9]  James Demmel,et al.  Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms , 2011, Euro-Par.

[10]  S. Lennart Johnsson,et al.  Minimizing the Communication Time for Matrix Multiplication on Multiprocessors , 1993, Parallel Comput..

[11]  Robert A. van de Geijn,et al.  SUMMA: Scalable Universal Matrix Multiplication Algorithm , 1995 .

[12]  Dror Irony,et al.  Communication lower bounds for distributed-memory matrix multiplication , 2004, J. Parallel Distributed Comput..

[13]  Sivan Toledo Locality of Reference in LU Decomposition with Partial Pivoting , 1997, SIAM J. Matrix Anal. Appl..

[14]  Alok Aggarwal,et al.  Communication Complexity of PRAMs , 1990, Theor. Comput. Sci..

[15]  James Demmel,et al.  Communication-optimal parallel algorithm for strassen's matrix multiplication , 2012, SPAA '12.

[16]  Andrew V. Goldberg,et al.  PHAST: Hardware-Accelerated Shortest Path Trees , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[17]  J. G. Fletcher,et al.  A more general algorithm for computing closed semiring costs between vertices of a directed graph , 1980, CACM.

[18]  Vipin Kumar,et al.  Scalability of Parallel Algorithms for the All-Pairs Shortest-Path Problem , 1991, J. Parallel Distributed Comput..

[19]  V. Strassen Gaussian elimination is not optimal , 1969 .

[20]  Robert E. Tarjan,et al.  Algorithmic aspects of vertex elimination , 1975, STOC.

[21]  James Demmel,et al.  Minimizing Communication in Numerical Linear Algebra , 2009, SIAM J. Matrix Anal. Appl..

[22]  Robert A. van de Geijn,et al.  SUMMA: scalable universal matrix multiplication algorithm , 1995, Concurr. Pract. Exp..

[23]  Roger Wattenhofer,et al.  Optimal distributed all pairs shortest paths and applications , 2012, PODC '12.

[24]  H. T. Kung,et al.  I/O complexity: The red-blue pebble game , 1981, STOC '81.

[25]  D. Rose,et al.  Generalized nested dissection , 1977 .

[26]  Viktor K. Prasanna,et al.  Optimizing graph algorithms for improved cache performance , 2002, IEEE Transactions on Parallel and Distributed Systems.

[27]  James Demmel,et al.  Graph expansion and communication costs of fast matrix multiplication: regular submission , 2011, SPAA '11.

[28]  Uri Zwick,et al.  All pairs shortest paths using bridging sets and rectangular matrix multiplication , 2000, JACM.

[29]  Lynn Elliot Cannon,et al.  A cellular computer to implement the kalman filter algorithm , 1969 .

[30]  Mihalis Yannakakis,et al.  High-Probability Parallel Transitive-Closure Algorithms , 1991, SIAM J. Comput..

[31]  Sartaj Sahni,et al.  All Pairs Shortest Paths on a Hypercube Multiprocessor , 1987, ICPP.

[32]  Donald B. Johnson,et al.  Efficient Algorithms for Shortest Paths in Sparse Networks , 1977, J. ACM.

[33]  Inderjit S. Dhillon,et al.  The Metric Nearness Problem , 2008, SIAM J. Matrix Anal. Appl..

[34]  Haris N. Koutsopoulos,et al.  A Decomposition Algorithm for the All-Pairs Shortest Path Problem on Massively Parallel Computer Architectures , 1994, Transp. Sci..

[35]  Richard C. Larson,et al.  Urban Operations Research , 1981 .

[36]  John R. Gilbert,et al.  Solving path problems on the GPU , 2010, Parallel Comput..

[37]  Alejandro López-Ortiz,et al.  REWIRE: An optimization-based framework for unstructured data center network design , 2012, 2012 Proceedings IEEE INFOCOM.

[38]  Ulrich Meyer,et al.  [Delta]-stepping: a parallelizable shortest path algorithm , 2003, J. Algorithms.

[39]  James Demmel,et al.  Improving communication performance in dense linear algebra via topology aware collectives , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[40]  Mihalis Yannakakis,et al.  High-probability parallel transitive closure algorithms , 1990, SPAA '90.

[41]  Raimund Seidel,et al.  On the All-Pairs-Shortest-Path Problem in Unweighted Undirected Graphs , 1995, J. Comput. Syst. Sci..

[42]  Jack Dongarra,et al.  ScaLAPACK user's guide , 1997 .

[43]  Alexander Tiskin,et al.  All-Pairs Shortest Paths Computation in the BSP Model , 2001, ICALP.

[44]  Alfred V. Aho,et al.  The Design and Analysis of Computer Algorithms , 1974 .