Trade-Offs Between Synchronization, Communication, and Computation in Parallel Linear Algebra Computations

This article derives trade-offs between three basic costs of a parallel algorithm: synchronization, data movement, and computational cost. These trade-offs are lower bounds on the execution time of the algorithm that are independent of the number of processors but dependent on the problem size. Therefore, they provide lower bounds on the execution time of any parallel schedule of an algorithm computed by a system composed of any number of homogeneous processors, each with associated computational, communication, and synchronization costs. We employ a theoretical model that measures the amount of work and data movement as a maximum over that incurred along any execution path during the parallel computation. By considering this metric rather than the total communication volume over the whole machine, we obtain new insights into the characteristics of parallel schedules for algorithms with nontrivial dependency structures. We also present reductions from BSP and LogGP algorithms to our execution model, extending our lower bounds to these two models of parallel computation. We first develop our results for general dependency graphs and hypergraphs based on their expansion properties, and then we apply the theorem to a number of specific algorithms in numerical linear algebra, namely triangular substitution, Cholesky factorization, and stencil computations. We represent some of these algorithms as families of dependency graphs. We derive their communication lower bounds by studying the communication requirements of the hypergraph structures shared by these dependency graphs. In addition to these lower bounds, we introduce a new communication-efficient parallelization for stencil computation algorithms, which is motivated by results of our lower bound analysis and the properties of previously existing parallelizations of the algorithms.

[1]  Bradley C. Kuszmaul,et al.  The pochoir stencil compiler , 2011, SPAA '11.

[2]  Ramesh Subramonian,et al.  LogP: towards a realistic model of parallel computation , 1993, PPOPP '93.

[3]  Chris J. Scheiman,et al.  LogGP: incorporating long messages into the LogP model—one step closer towards a realistic model for parallel computation , 1995, SPAA '95.

[4]  James Demmel,et al.  Minimizing Communication in Numerical Linear Algebra , 2009, SIAM J. Matrix Anal. Appl..

[5]  V. Strassen Gaussian elimination is not optimal , 1969 .

[6]  Alok Aggarwal,et al.  Communication Complexity of PRAMs , 1990, Theor. Comput. Sci..

[7]  Alexander Tiskin,et al.  Memory-Efficient Matrix Multiplication in the BSP Model , 1999, Algorithmica.

[8]  B GibbonsPhillip ACM transactions on parallel computing , 2014 .

[9]  H. Whitney,et al.  An inequality related to the isoperimetric inequality , 1949 .

[10]  James Demmel,et al.  Minimizing Communication in Linear Algebra , 2009, ArXiv.

[11]  James Demmel,et al.  Communication lower bounds and optimal algorithms for programs that reference arrays - Part 1 , 2013, ArXiv.

[12]  James Demmel,et al.  Avoiding communication in sparse matrix computations , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[13]  Danny C. Sorensen,et al.  Analysis of Pairwise Pivoting in Gaussian Elimination , 1985, IEEE Transactions on Computers.

[14]  Christos H. Papadimitriou,et al.  A Communication-Time Tradeoff , 1987, SIAM J. Comput..

[15]  Optimal Schedules for d-D Grid Graphs with Communication Delays (Extended Abstract) , 1996, STACS.

[16]  Dror Irony,et al.  Communication lower bounds for distributed-memory matrix multiplication , 2004, J. Parallel Distributed Comput..

[17]  L. R. Ford,et al.  NETWORK FLOW THEORY , 1956 .

[18]  Michael T. Heath,et al.  Parallel solution of triangular systems on distributed-memory multiprocessors , 1988 .

[19]  James Demmel,et al.  Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms , 2011, Euro-Par.

[20]  Dror Irony,et al.  Trading Replication for Communication in Parallel Distributed-Memory Dense Solvers , 2002, Parallel Process. Lett..

[21]  Michael A. Bender,et al.  Optimal Sparse Matrix Dense Vector Multiplication in the I/O-Model , 2007, SPAA '07.

[22]  Sivan Toledo,et al.  Efficient out-of-core algorithms for linear relaxation using blocking covers , 1993, Proceedings of 1993 IEEE 34th Annual Foundations of Computer Science.

[23]  Richard Bellman,et al.  ON A ROUTING PROBLEM , 1958 .

[24]  Alexander Tiskin Communication-efficient parallel generic pairwise elimination , 2007, Future Gener. Comput. Syst..

[25]  James Demmel,et al.  Tradeoffs between synchronization, communication, and computation in parallel linear algebra computations , 2014, SPAA.

[26]  Michele Scquizzato,et al.  Communication Lower Bounds for Distributed-Memory Computations , 2013, STACS.

[27]  Stephen Warshall,et al.  A Theorem on Boolean Matrices , 1962, JACM.

[28]  James Demmel,et al.  CALU: A Communication Optimal LU Factorization Algorithm , 2011, SIAM J. Matrix Anal. Appl..

[29]  Alexander Tiskin,et al.  The design and analysis of bulk-synchronous parallel algorithms , 1998 .

[30]  H. T. Kung,et al.  I/O complexity: The red-blue pebble game , 1981, STOC '81.

[31]  F. P. Preparata,et al.  Processor—Time Tradeoffs under Bounded-Speed Message Propagation: Part I, Upper Bounds , 1995, Theory of Computing Systems.

[32]  A. Tiskin Bulk-Synchronous Parallel Gaussian Elimination , 2002 .

[33]  James Demmel,et al.  Minimizing communication in sparse matrix solvers , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[34]  Franco P. Preparata,et al.  Processor—Time Tradeoffs under Bounded-Speed Message Propagation: Part II, Lower Bounds , 1999, Theory of Computing Systems.

[35]  Jack Dongarra,et al.  ScaLAPACK user's guide , 1997 .

[36]  James Demmel,et al.  Avoiding Communication in Nonsymmetric Lanczos-Based Krylov Subspace Methods , 2013, SIAM J. Sci. Comput..

[37]  Alexander Tiskin,et al.  All-Pairs Shortest Paths Computation in the BSP Model , 2001, ICALP.

[38]  Evripidis Bampis,et al.  Optimal Schedules for d-D Grid Graphs with Communication Delays , 1998, Parallel Comput..

[39]  J. Demmel,et al.  Tradeoffs between synchronization , communication , and work in parallel linear algebra computations , 2014 .

[40]  James Demmel,et al.  Brief announcement: strong scaling of matrix multiplication algorithms and memory-independent communication lower bounds , 2012, SPAA '12.

[41]  Robert A. van de Geijn,et al.  Elemental: A New Framework for Distributed Memory Dense Matrix Computations , 2013, TOMS.

[42]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[43]  Gianfranco Bilardi,et al.  A Lower Bound Technique for Communication on BSP with Application to the FFT , 2012, Euro-Par.

[44]  J. Demmel,et al.  Avoiding Communication in Computing Krylov Subspaces , 2007 .