Communication Lower Bounds for Distributed-Memory Computations

In this paper we propose a new approach to the study of the communication requirements of distributed computations, which advocates for the removal of the restrictive assumptions under which earlier results were derived. We illustrate our approach by giving tight lower bounds on the communication complexity required to solve several computational problems in a distributed-memory parallel machine, namely standard matrix multiplication, stencil computations, comparison sorting, and the Fast Fourier Transform. Our bounds rely only on a mild assumption on work distribution, and significantly strengthen previous results which require either the computation to be balanced among the processors, or specific initial distributions of the input data, or an upper bound on the size of processors' local memories.

[1]  Dror Irony,et al.  Communication lower bounds for distributed-memory matrix multiplication , 2004, J. Parallel Distributed Comput..

[2]  Geppino Pucci,et al.  Network-Oblivious Algorithms , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[3]  Volker Strumpen,et al.  Cache oblivious stencil computations , 2005, ICS '05.

[4]  Gianfranco Bilardi,et al.  A Lower Bound Technique for Communication on BSP with Application to the FFT , 2012, Euro-Par.

[5]  DemmelJames,et al.  Graph expansion and communication costs of fast matrix multiplication , 2013 .

[6]  Alexander Tiskin,et al.  BSP (Bulk Synchronous Parallelism) , 2011, Encyclopedia of Parallel Computing.

[7]  F. P. Preparata,et al.  Processor—Time Tradeoffs under Bounded-Speed Message Propagation: Part I, Upper Bounds , 1995, Theory of Computing Systems.

[8]  Geppino Pucci,et al.  Fast Deterministic Parallel Branch-and-Bound , 1999, Parallel Process. Lett..

[9]  Shantanu Dutt,et al.  Sequential and Parallel Branch-and-Bound Search under Limited-Memory Constraints , 1999 .

[10]  Alexander Tiskin,et al.  The design and analysis of bulk-synchronous parallel algorithms , 1998 .

[11]  Abhiram G. Ranade Optimal speedup for backtrack search on a butterfly network , 1991, SPAA '91.

[12]  James Demmel,et al.  Brief announcement: strong scaling of matrix multiplication algorithms and memory-independent communication lower bounds , 2012, SPAA '12.

[13]  L. R. Kerr The Effect of Algebraic Structure on the Computational Complexity of Matrix Multiplication , 1970 .

[14]  Ramesh Subramonian,et al.  LogP: a practical model of parallel computation , 1996, CACM.

[15]  Leslie G. Valiant A Bridging Model for Multi-core Computing , 2008, ESA.

[16]  Richard J. Anderson Optical Communication for Pointer Based Algorithms , 1988 .

[17]  Alok Aggarwal,et al.  Communication Complexity of PRAMs , 1990, Theor. Comput. Sci..

[18]  H. T. Kung,et al.  Communication complexity for parallel divide-and-conquer , 1991, [1991] Proceedings 32nd Annual Symposium of Foundations of Computer Science.

[19]  H. Whitney,et al.  An inequality related to the isoperimetric inequality , 1949 .

[20]  Michael E. Saks,et al.  On a search problem related to branch-and-bound procedures , 1986, 27th Annual Symposium on Foundations of Computer Science (sfcs 1986).

[21]  Vijaya Ramachandran,et al.  Oblivious algorithms for multicores and network of processors , 2010, IPDPS.

[22]  Eli Upfal,et al.  Probability and Computing: Randomized Algorithms and Probabilistic Analysis , 2005 .

[23]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[24]  James Demmel,et al.  Minimizing Communication in Numerical Linear Algebra , 2009, SIAM J. Matrix Anal. Appl..

[25]  Andrea Pietracaprina,et al.  On the Space and Access Complexity of Computation DAGs , 2000, WG.

[26]  John E. Savage Extending the Hong-Kung Model to Memory Hierarchies , 1995, COCOON.

[27]  Tse-Yun Feng,et al.  The Universality of the Shuffle-Exchange Network , 1981, IEEE Transactions on Computers.

[28]  James Demmel,et al.  Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms , 2011, Euro-Par.

[29]  Michael T. Goodrich,et al.  Communication-Efficient Parallel Sorting , 1999, SIAM J. Comput..

[30]  Frank Thomson Leighton,et al.  Doubly Logarithmic Communication Algorithms for Optical-Communication Parallel Computers , 1997, SIAM J. Comput..

[31]  Alok Aggarwal,et al.  The input/output complexity of sorting and related problems , 1988, CACM.

[32]  Geppino Pucci,et al.  Deterministic Branch-and-Bound on Distributed Memory Machines , 1999, Int. J. Found. Comput. Sci..

[33]  F. Leighton,et al.  Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes , 1991 .

[34]  Eli Upfal,et al.  Space-round tradeoffs for MapReduce computations , 2011, ICS '12.

[35]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[36]  Guy E. Blelloch,et al.  Provably good multicore cache performance for divide-and-conquer algorithms , 2008, SODA '08.

[37]  Christos H. Papadimitriou,et al.  A Communication-Time Tradeoff , 1987, SIAM J. Comput..

[38]  H. T. Kung,et al.  I/O complexity: The red-blue pebble game , 1981, STOC '81.

[39]  Geppino Pucci,et al.  Deterministic parallel backtrack search , 2002, Theor. Comput. Sci..

[40]  Leslie G. Valiant,et al.  Bulk synchronous parallel computing-a paradigm for transportable software , 1995, Proceedings of the Twenty-Eighth Annual Hawaii International Conference on System Sciences.

[41]  Alexander Tiskin,et al.  Bulk-Synchronous Parallel Multiplication of Boolean Matrices , 1998, ICALP.

[42]  V. Strassen Gaussian elimination is not optimal , 1969 .

[43]  Sergei Vassilvitskii,et al.  A model of computation for MapReduce , 2010, SODA '10.

[44]  Desh Ranjan,et al.  Strong I/O Lower Bounds for Binomial and FFT Computation Graphs , 2011, COCOON.

[45]  John E. Savage,et al.  Models of computation - exploring the power of computing , 1998 .

[46]  Sajal K. Das,et al.  Book Review: Introduction to Parallel Algorithms and Architectures : Arrays, Trees, Hypercubes by F. T. Leighton (Morgan Kauffman Pub, 1992) , 1992, SIGA.

[47]  Geppino Pucci,et al.  Space-Efficient Parallel Algorithms for Combinatorial Search Problems , 2013 .

[48]  Richard M. Karp,et al.  Randomized parallel algorithms for backtrack search and branch-and-bound computation , 1993, JACM.

[49]  Franco P. Preparata,et al.  Processor—Time Tradeoffs under Bounded-Speed Message Propagation: Part II, Lower Bounds , 1999, Theory of Computing Systems.

[50]  Christos Kaklamanis,et al.  Branch-and-bound and backtrack search on mesh-connected arrays of processors , 1992, SPAA '92.

[51]  William Aiello,et al.  An atomic model for message-passing , 1993, SPAA '93.

[52]  G ValiantLeslie A bridging model for parallel computation , 1990 .

[53]  Greg N. Frederickson The information theory bound is tight for selection in a heap , 1990, STOC '90.