Network Topologies and Inevitable Contention

Network topologies can have significant effect on the execution costs of parallel algorithms due to inter-processor communication. For particular combinations of computations and network topologies, costly network contention may inevitably become a bottleneck, even if algorithms are optimally designed so that each processor communicates as little as possible. We obtain novel contention lower bounds that are functions of the network and the computation graph parameters. For several combinations of fundamental computations and common network topologies, our new analysis improves upon previous per-processor lower bounds which only specify the number of words communicated by the busiest individual processor. We consider torus and mesh topologies, universal fat-trees, and hypercubes; algorithms covered include classical matrix multiplication and direct numerical linear algebra, fast matrix multiplication algorithms, programs that reference arrays, N-body computations, and the FFT. For example, we show that fast matrix multiplication algorithms (e.g., Strassen's) running on a 3D torus will suffer from contention bottlenecks. On the other hand, this network is likely sufficient for a classical matrix multiplication algorithm. Our new lower bounds are matched by existing algorithms only in very few cases, leaving many open problems for network and algorithmic design.

[1]  James Demmel,et al.  Brief announcement: strong scaling of matrix multiplication algorithms and memory-independent communication lower bounds , 2012, SPAA '12.

[2]  G. Bilardi,et al.  Deterministic on-line routing on area-universal networks , 1990, Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science.

[3]  James Demmel,et al.  Communication lower bounds and optimal algorithms for programs that reference arrays - Part 1 , 2013, ArXiv.

[4]  John E. Savage Extending the Hong-Kung Model to Memory Hierarchies , 1995, COCOON.

[5]  Alexander Tiskin,et al.  Memory-Efficient Matrix Multiplication in the BSP Model , 1999, Algorithmica.

[6]  Alok Aggarwal,et al.  Communication Complexity of PRAMs , 1990, Theor. Comput. Sci..

[7]  Béla Bollobás,et al.  Edge-isoperimetric inequalities in the grid , 1991, Comb..

[8]  V. Strassen Relative bilinear complexity and matrix multiplication. , 1987 .

[9]  James Demmel,et al.  Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms , 2011, Euro-Par.

[10]  P. Heidelberger,et al.  The IBM Blue Gene/Q Interconnection Fabric , 2012, IEEE Micro.

[11]  Charles E. Leiserson,et al.  Randomized Routing on Fat-Trees , 1989, Adv. Comput. Res..

[12]  John H. Lindsey,et al.  Assignment of Numbers to Vertices , 1964 .

[13]  William J. Dally,et al.  Technology-Driven, Highly-Scalable Dragonfly Topology , 2008, 2008 International Symposium on Computer Architecture.

[14]  Robert A. van de Geijn,et al.  Collective communication: theory, practice, and experience: Research Articles , 2007 .

[15]  Dror Irony,et al.  Communication lower bounds for distributed-memory matrix multiplication , 2004, J. Parallel Distributed Comput..

[16]  François Le Gall,et al.  Powers of tensors and fast matrix multiplication , 2014, ISSAC.

[17]  Katherine A. Yelick,et al.  A Communication-Optimal N-Body Algorithm for Direct Interactions , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[18]  James Demmel,et al.  Graph Expansion Analysis for Communication Costs of Fast Rectangular Matrix Multiplication , 2012, MedAlg.

[19]  H. T. Kung,et al.  I/O complexity: The red-blue pebble game , 1981, STOC '81.

[20]  James Demmel,et al.  Exploiting Data Sparsity in Parallel Matrix Powers Computations , 2013, PPAM.

[21]  Robert A. van de Geijn,et al.  Collective communication: theory, practice, and experience , 2007, Concurr. Comput. Pract. Exp..

[22]  James Demmel,et al.  Minimizing Communication in Numerical Linear Algebra , 2009, SIAM J. Matrix Anal. Appl..

[23]  Ibm Blue,et al.  Overview of the IBM Blue Gene/P Project , 2008, IBM J. Res. Dev..

[24]  N. Linial,et al.  Expander Graphs and their Applications , 2006 .

[25]  Gianfranco Bilardi,et al.  A Lower Bound Technique for Communication on BSP with Application to the FFT , 2012, Euro-Par.

[26]  James Demmel,et al.  Communication optimal parallel multiplication of sparse random matrices , 2013, SPAA.

[27]  Michele Scquizzato,et al.  Communication Lower Bounds for Distributed-Memory Computations , 2013, STACS.

[28]  Arnold Schönhage,et al.  Partial and Total Matrix Multiplication , 1981, SIAM J. Comput..

[29]  Grey Ballard,et al.  Avoiding Communication in Dense Linear Algebra , 2013 .

[30]  V. Strassen Gaussian elimination is not optimal , 1969 .

[31]  Oded Schwartz,et al.  Matrix Multiplication I/O-Complexity by Path Routing , 2015, SPAA.

[32]  Lynn Elliot Cannon,et al.  A cellular computer to implement the kalman filter algorithm , 1969 .

[33]  Toshiyuki Shimizu,et al.  Tofu: A 6D Mesh/Torus Interconnect for Exascale Computers , 2009, Computer.

[34]  Lorenzo De Stefani,et al.  The I/O Complexity of Strassen's Matrix Multiplication with Recomputation , 2016, WADS.

[35]  James Reinders,et al.  Intel Xeon Phi Coprocessor High Performance Programming , 2013 .

[36]  Jarle Berntsen,et al.  Communication efficient matrix multiplication on hypercubes , 1989, Parallel Comput..

[37]  Franco P. Preparata,et al.  Area-time lower-bound techniques with applications to sorting , 2005, Algorithmica.

[38]  Charles E. Leiserson,et al.  Fat-trees: Universal networks for hardware-efficient supercomputing , 1985, IEEE Transactions on Computers.

[39]  Michael T. Goodrich,et al.  Communication-Efficient Parallel Sorting , 1999, SIAM J. Comput..