Communication lower bounds and optimal algorithms for programs that reference arrays - Part 1

Abstract : Communication, i.e., moving data, between levels of a memory hierarchy or between parallel processors on a network, can greatly dominate the cost of computation, so algorithms that minimize communication can run much faster (and use less energy) than algorithms that do not. Motivated by this, attainable communication lower bounds were established in [12, 13, 4] for a variety of algorithms including matrix computations. The lower bound approach used initially in [13] for Theta(N3) matrix multiplication, and later in [4] for many other linear algebra algorithms, depended on a geometric result by Loomis and Whitney [16]: this result bounded the volume of a 3D set (representing multiply-adds done in the inner loop of the algorithm) using the product of the areas of certain 2D projections of this set (representing the matrix entries available locally, i.e., without communication). Using a recent generalization of Loomis' and Whitney's result, we generalize this lower bound approach to a much larger class of algorithms, that may have arbitrary numbers of loops and arrays with arbitrary dimensions as long as the index expressions are a ne combinations of loop variables. In other words, the algorithm can do arbitrary operations on any number of variables like A(i(sub 1), i(sub 2), i(sub 2) - 2i(sub 1), 3 - 4i(sub 3) + 7i(sub 4), ...). Moreover, the result applies to recursive programs, irregular iteration spaces, sparse matrices, and other data structures as long as the computation can be logically mapped to loops and indexed data structure accesses. We also discuss when optimal algorithms exist that attain the lower bounds; this leads to new asymptotically faster algorithms for several problems.

[1]  James Demmel,et al.  Improving communication performance in dense linear algebra via topology aware collectives , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[2]  Telecommunications Board The Future of Computing Performance: Game Over or Next Level? , 2011 .

[3]  H. Whitney,et al.  An inequality related to the isoperimetric inequality , 1949 .

[4]  Marc Snir,et al.  GETTING UP TO SPEED THE FUTURE OF SUPERCOMPUTING , 2004 .

[5]  Ioannis K. Argyros,et al.  OF PURE AND APPLIED MATHEMATICS , 2003 .

[6]  Erik Massop Hilbert's tenth problem , 2012 .

[7]  Ravi Vakil Murphy’s law in algebraic geometry: Badly-behaved deformation spaces , 2004 .

[8]  Alexander Tiskin Communication-efficient parallel generic pairwise elimination , 2007, Future Gener. Comput. Syst..

[9]  James Demmel,et al.  Brief announcement: communication bounds for heterogeneous architectures , 2011, SPAA '11.

[10]  H. T. Kung,et al.  I/O complexity: The red-blue pebble game , 1981, STOC '81.

[11]  Samuel H. Fuller,et al.  The Future of Computing Performance: Game Over or Next Level? , 2014 .

[12]  A. Tarski A Decision Method for Elementary Algebra and Geometry , 2023 .

[13]  C. Bennett,et al.  Interpolation of operators , 1987 .

[14]  George E. Collins,et al.  Quantifier elimination for real closed fields by cylindrical algebraic decomposition , 1975 .

[15]  James Demmel,et al.  Brief announcement: strong scaling of matrix multiplication algorithms and memory-independent communication lower bounds , 2012, SPAA '12.

[16]  James Demmel,et al.  Minimizing Communication in Numerical Linear Algebra , 2009, SIAM J. Matrix Anal. Appl..

[17]  J. K. Hunter,et al.  Measure Theory , 2007 .

[18]  T. Tao,et al.  Finite bounds for Hölder-Brascamp-Lieb multilinear inequalities , 2005, math/0505691.

[19]  James Demmel,et al.  Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms , 2011, Euro-Par.

[20]  Alexander Tiskin,et al.  Memory-Efficient Matrix Multiplication in the BSP Model , 1999, Algorithmica.

[21]  Alexander Tiskin,et al.  The Bulk-Synchronous Parallel Random Access Machine , 1996, Theor. Comput. Sci..

[22]  Katherine A. Yelick,et al.  A Communication-Optimal N-Body Algorithm for Direct Interactions , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[23]  Alok Aggarwal,et al.  Communication Complexity of PRAMs , 1990, Theor. Comput. Sci..

[24]  J. Koenigsmann Defining Z in Q , 2010 .

[25]  Dror Irony,et al.  Communication lower bounds for distributed-memory matrix multiplication , 2004, J. Parallel Distributed Comput..

[26]  Stefán Ingi Valdimarsson The Brascamp–Lieb Polyhedron , 2010, Canadian Journal of Mathematics.

[27]  R. Tennant Algebra , 1941, Nature.