Bounds for Heterogeneous Architectures

As the gap between the cost of communication (i.e., data movement) and computation continues to grow, pursuing algorithms which minimize communication has become a critical research objective. Toward this end, we seek asymptotic communication lower bounds for general memory models and classes of algorithms. Recent work has established lower bounds for a wide set of linear algebra algorithms (which include both classical O(n) and Strassen’s O(n ) algorithms for matrix-matrix multiplication) on a sequential machine and on a distributedmemory parallel machine with identical processors. This work extends these previous bounds to a heterogeneous model in which processors access data and perform floating point operations at differing speeds. We also present algorithms which prove that the lower bounds are tight (i.e., attainable) for dense matrix-vector multiplication and both classical and Strassen’s matrix-matrix multiplication algorithms.

[1]  H. Whitney,et al.  An inequality related to the isoperimetric inequality , 1949 .

[2]  V. Strassen Gaussian elimination is not optimal , 1969 .

[3]  H. T. Kung,et al.  I/O complexity: The red-blue pebble game , 1981, STOC '81.

[4]  Bharat Kumar,et al.  A tensor product formulation of Strassen's matrix multiplication algorithm with memory reduction , 1995 .

[5]  Matteo Frigo,et al.  DAG-consistent distributed shared memory , 1996, Proceedings of International Conference on Parallel Processing.

[6]  David S. Wise Ahnentafel Indexing into Morton-Ordered Arrays, or Matrix Locality for Free , 2000, Euro-Par.

[7]  J. Demmel,et al.  An updated set of basic linear algebra subprograms (BLAS) , 2002, TOMS.

[8]  Erik Elmroth,et al.  SIAM REVIEW c ○ 2004 Society for Industrial and Applied Mathematics Vol. 46, No. 1, pp. 3–45 Recursive Blocked Algorithms and Hybrid Data Structures for Dense Matrix Library Software ∗ , 2022 .

[9]  Dror Irony,et al.  Communication lower bounds for distributed-memory matrix multiplication , 2004, J. Parallel Distributed Comput..

[10]  Michael Bader,et al.  Hardware-Oriented Implementation of Cache Oblivious Matrix Operations Based on Space-Filling Curves , 2007, PPAM.

[11]  James Demmel,et al.  Communication avoiding Gaussian elimination , 2008, HiPC 2008.

[12]  James Demmel,et al.  Minimizing communication in sparse matrix solvers , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[13]  James Demmel,et al.  Communication-optimal Parallel and Sequential Cholesky Decomposition , 2009, SIAM J. Sci. Comput..

[14]  James Demmel,et al.  Brief announcement: communication bounds for heterogeneous architectures , 2011, SPAA '11.

[15]  James Demmel,et al.  Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms , 2011, Euro-Par.

[16]  James Demmel,et al.  Graph expansion and communication costs of fast matrix multiplication: regular submission , 2011, SPAA '11.

[17]  James Demmel,et al.  Minimizing Communication in Numerical Linear Algebra , 2009, SIAM J. Matrix Anal. Appl..

[18]  James Demmel,et al.  Communication-optimal Parallel and Sequential QR and LU Factorizations , 2008, SIAM J. Sci. Comput..

[19]  James Demmel,et al.  Brief announcement: strong scaling of matrix multiplication algorithms and memory-independent communication lower bounds , 2012, SPAA '12.

[20]  James Demmel,et al.  Communication-optimal parallel algorithm for strassen's matrix multiplication , 2012, SPAA '12.