Tradeoffs between synchronization, communication, and computation in parallel linear algebra computations

This paper derives tradeoffs between three basic costs of a parallel algorithm: synchronization, data movement, and computational cost. These tradeoffs are lower bounds on the execution time of the algorithm which are independent of the number of processors, but dependent on the problem size. Therefore, they provide lower bounds on the parallel execution time of any algorithm computed by a system composed of any number of homogeneous components, each with associated computational, communication, and synchronization payloads. We employ a theoretical model counts the amount of work and data movement as a maximum of any execution path during the parallel computation. By considering this metric, rather than the total communication volume over the whole machine, we obtain new insights into the characteristics of parallel schedules for algorithms with non-trivial dependency structures. We also present reductions from BSP and LogP algorithms to our execution model, extending our lower bounds to these two models of parallel computation. We first develop our results for general dependency graphs and hypergraphs based on their expansion properties, then we apply the theorem to a number of specific algorithms in numerical linear algebra, namely triangular substitution, Gaussian elimination, and Krylov subspace methods. Our lower bound for LU factorization demonstrates the optimality of Tiskin's LU algorithm answering an open question posed in his paper, as well as of the 2.5D LU algorithm which has analogous costs. We treat the computations in a general manner by noting that the computations share a similar dependency hypergraph structure and analyzing the communication requirements of lattice hypergraph structures.

[1]  F. P. Preparata,et al.  Processor—Time Tradeoffs under Bounded-Speed Message Propagation: Part I, Upper Bounds , 1995, Theory of Computing Systems.

[2]  Alexander Tiskin Communication-efficient parallel generic pairwise elimination , 2007, Future Gener. Comput. Syst..

[3]  H. T. Kung,et al.  I/O complexity: The red-blue pebble game , 1981, STOC '81.

[4]  Alok Aggarwal,et al.  Communication Complexity of PRAMs , 1990, Theor. Comput. Sci..

[5]  Jack Dongarra,et al.  ScaLAPACK user's guide , 1997 .

[6]  V. Strassen Gaussian elimination is not optimal , 1969 .

[7]  Michael T. Heath,et al.  Parallel solution of triangular systems on distributed-memory multiprocessors , 1988 .

[8]  James Demmel,et al.  Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms , 2011, Euro-Par.

[9]  Dror Irony,et al.  TRADING REPLICATION FOR COMMUNICATION IN PARALLEL DISTRIBUTED-MEMORY DENSE SOLVERS , 2002 .

[10]  Ramesh Subramonian,et al.  LogP: towards a realistic model of parallel computation , 1993, PPOPP '93.

[11]  Alexander Tiskin,et al.  All-Pairs Shortest Paths Computation in the BSP Model , 2001, ICALP.

[12]  Evripidis Bampis,et al.  Optimal Schedules for d-D Grid Graphs with Communication Delays , 1998, Parallel Comput..

[13]  James Demmel,et al.  Brief announcement: strong scaling of matrix multiplication algorithms and memory-independent communication lower bounds , 2012, SPAA '12.

[14]  Dror Irony,et al.  Communication lower bounds for distributed-memory matrix multiplication , 2004, J. Parallel Distributed Comput..

[15]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[16]  Michael A. Bender,et al.  Optimal Sparse Matrix Dense Vector Multiplication in the I/O-Model , 2007, SPAA '07.

[17]  James Demmel,et al.  Improving communication performance in dense linear algebra via topology aware collectives , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[18]  Alexander Tiskin,et al.  Memory-Efficient Matrix Multiplication in the BSP Model , 1999, Algorithmica.

[19]  James Demmel,et al.  Minimizing Communication in Numerical Linear Algebra , 2009, SIAM J. Matrix Anal. Appl..

[20]  Danny C. Sorensen,et al.  Analysis of Pairwise Pivoting in Gaussian Elimination , 1985, IEEE Transactions on Computers.

[21]  Christos H. Papadimitriou,et al.  A Communication-Time Tradeoff , 1987, SIAM J. Comput..

[22]  Optimal Schedules for d-D Grid Graphs with Communication Delays (Extended Abstract) , 1996, STACS.

[23]  H. Whitney,et al.  An inequality related to the isoperimetric inequality , 1949 .

[24]  Robert A. van de Geijn,et al.  Elemental: A New Framework for Distributed Memory Dense Matrix Computations , 2013, TOMS.

[25]  Stephen Warshall,et al.  A Theorem on Boolean Matrices , 1962, JACM.