Contention Bounds for Combinations of Computation Graphs and Network Topologies.

Abstract : Network topologies can have significant effect on the costs of algorithms due to inter-processor communication. Parallel algorithms that ignore network topology can suffer from contention along network links. However, for particular combinations of computations and network topologies, costly network contention may inevitably become a bottleneck, even for optimally designed algorithms. We obtain a novel contention lower bound that is a function of the network and the computation graph parameters. To this end, we compare the communication bandwidth needs of subsets of processors and the available network capacity (as opposed to per-processor analysis in most previous studies). Applying this analysis we improve communication cost lower bounds for several combinations of fundamental computations on common network topologies.

[1]  H. Whitney,et al.  An inequality related to the isoperimetric inequality , 1949 .

[2]  V. Strassen Gaussian elimination is not optimal , 1969 .

[3]  Alok Aggarwal,et al.  Communication Complexity of PRAMs , 1990, Theor. Comput. Sci..

[4]  Charles E. Leiserson,et al.  Randomized Routing on Fat-Trees , 1989, Adv. Comput. Res..

[5]  James Demmel,et al.  Exploiting Data Sparsity in Parallel Matrix Powers Computations , 2013, PPAM.

[6]  Viktor K. Prasanna,et al.  Optimizing graph algorithms for improved cache performance , 2002, IEEE Transactions on Parallel and Distributed Systems.

[7]  James Demmel,et al.  Communication lower bounds and optimal algorithms for programs that reference arrays - Part 1 , 2013, ArXiv.

[8]  Alok Aggarwal,et al.  The input/output complexity of sorting and related problems , 1988, CACM.

[9]  James Demmel,et al.  Tradeoffs between synchronization, communication, and computation in parallel linear algebra computations , 2014, SPAA.

[10]  Michele Scquizzato,et al.  Communication Lower Bounds for Distributed-Memory Computations , 2013, STACS.

[11]  Arnold Schönhage,et al.  Partial and Total Matrix Multiplication , 1981, SIAM J. Comput..

[12]  Gianfranco Bilardi,et al.  Deterministic on-line routing on area-universal networks , 1995, JACM.

[13]  James Demmel,et al.  Graph Expansion Analysis for Communication Costs of Fast Rectangular Matrix Multiplication , 2012, MedAlg.

[14]  H. T. Kung,et al.  I/O complexity: The red-blue pebble game , 1981, STOC '81.

[15]  Grey Ballard,et al.  Avoiding Communication in Dense Linear Algebra , 2013 .

[16]  P. Heidelberger,et al.  The IBM Blue Gene/Q Interconnection Fabric , 2012, IEEE Micro.

[17]  Dror Irony,et al.  Communication lower bounds for distributed-memory matrix multiplication , 2004, J. Parallel Distributed Comput..

[18]  Virginia Vassilevska Williams,et al.  Multiplying matrices faster than coppersmith-winograd , 2012, STOC '12.

[19]  Béla Bollobás,et al.  Edge-isoperimetric inequalities in the grid , 1991, Comb..

[20]  V. Strassen Relative bilinear complexity and matrix multiplication. , 1987 .

[21]  Charles E. Leiserson,et al.  Fat-trees: Universal networks for hardware-efficient supercomputing , 1985, IEEE Transactions on Computers.

[22]  Jack Dongarra,et al.  ScaLAPACK Users' Guide , 1987 .

[23]  James Demmel,et al.  Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms , 2011, Euro-Par.

[24]  P. Sadayappan,et al.  Communication-Efficient Matrix Multiplication on Hypercubes , 1996, Parallel Comput..

[25]  James Demmel,et al.  Communication optimal parallel multiplication of sparse random matrices , 2013, SPAA.

[26]  J. Demmel,et al.  Tradeoffs between synchronization , communication , and work in parallel linear algebra computations , 2014 .

[27]  Philip Heidelberger,et al.  Blue Gene/L torus interconnection network , 2005, IBM J. Res. Dev..

[28]  James Demmel,et al.  Brief announcement: strong scaling of matrix multiplication algorithms and memory-independent communication lower bounds , 2012, SPAA '12.

[29]  Gianfranco Bilardi,et al.  A Lower Bound Technique for Communication on BSP with Application to the FFT , 2012, Euro-Par.

[30]  Michael T. Goodrich,et al.  Communication-Efficient Parallel Sorting , 1999, SIAM J. Comput..

[31]  Lynn Elliot Cannon,et al.  A cellular computer to implement the kalman filter algorithm , 1969 .

[32]  Toshiyuki Shimizu,et al.  Tofu: A 6D Mesh/Torus Interconnect for Exascale Computers , 2009, Computer.

[33]  R. J. Joenk,et al.  IBM journal of research and development: information for authors , 1978 .

[34]  James Reinders,et al.  Intel Xeon Phi Coprocessor High Performance Programming , 2013 .

[35]  Alexander Tiskin,et al.  Memory-Efficient Matrix Multiplication in the BSP Model , 1999, Algorithmica.

[36]  Katherine A. Yelick,et al.  A Communication-Optimal N-Body Algorithm for Direct Interactions , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[37]  Robert A. van de Geijn,et al.  Collective communication: theory, practice, and experience , 2007, Concurr. Comput. Pract. Exp..

[38]  James Demmel,et al.  Minimizing Communication in Numerical Linear Algebra , 2009, SIAM J. Matrix Anal. Appl..