Communication lower bounds and optimal algorithms for numerical linear algebra*†

The traditional metric for the efficiency of a numerical algorithm has been the number of arithmetic operations it performs. Technological trends have long been reducing the time to perform an arithmetic operation, so it is no longer the bottleneck in many algorithms; rather, communication, or moving data, is the bottleneck. This motivates us to seek algorithms that move as little data as possible, either between levels of a memory hierarchy or between parallel processors over a network. In this paper we summarize recent progress in three aspects of this problem. First we describe lower bounds on communication. Some of these generalize known lower bounds for dense classical (O(n3)) matrix multiplication to all direct methods of linear algebra, to sequential and parallel algorithms, and to dense and sparse matrices. We also present lower bounds for Strassen-like algorithms, and for iterative methods, in particular Krylov subspace methods applied to sparse matrices. Second, we compare these lower bounds to widely used versions of these algorithms, and note that these widely used algorithms usually communicate asymptotically more than is necessary. Third, we identify or invent new algorithms for most linear algebra problems that do attain these lower bounds, and demonstrate large speed-ups in theory and practice.

[1]  G. Miller On the Solution of a System of Linear Equations , 1910 .

[2]  H. Whitney,et al.  An inequality related to the isoperimetric inequality , 1949 .

[3]  C. Lanczos An iteration method for the solution of the eigenvalue problem of linear differential and integral operators , 1950 .

[4]  W. Arnoldi The principle of minimized iterations in the solution of the matrix eigenvalue problem , 1951 .

[5]  M. Hestenes,et al.  Methods of conjugate gradients for solving linear systems , 1952 .

[6]  Stephen Warshall,et al.  A Theorem on Boolean Matrices , 1962, JACM.

[7]  Stephen J. Garland,et al.  Algorithm 97: Shortest path , 1962, Commun. ACM.

[8]  Å. Björck Solving linear least squares problems by Gram-Schmidt orthogonalization , 1967 .

[9]  V. Strassen Gaussian elimination is not optimal , 1969 .

[10]  Lynn Elliot Cannon,et al.  A cellular computer to implement the kalman filter algorithm , 1969 .

[11]  N. Abdelmalek Round off error analysis for Gram-Schmidt method and solution of linear least squares problems , 1971 .

[12]  J. O. Aasen On the reduction of a symmetric matrix to tridiagonal form , 1971 .

[13]  D. Rose,et al.  Complexity Bounds for Regular Finite Difference and Finite Element Grids , 1973 .

[14]  A. George Nested Dissection of a Regular Finite Element Mesh , 1973 .

[15]  A. Kiełbasiński Analiza numeryczna algorytmu ortogonalizacji Grama-Schmidta , 1974 .

[16]  J. Bunch,et al.  Some stable methods for calculating inertia and solving symmetric linear systems , 1977 .

[17]  Jack J. Dongarra,et al.  Matrix Eigensystem Routines - EISPACK Guide, Second Edition , 1976, Lecture Notes in Computer Science.

[18]  B. S. Garbow,et al.  Matrix Eigensystem Routines — EISPACK Guide , 1974, Lecture Notes in Computer Science.

[19]  Brian T. Smith,et al.  Matrix Eigensystem Routines — EISPACK Guide , 1974, Lecture Notes in Computer Science.

[20]  Charles L. Lawson,et al.  Basic Linear Algebra Subprograms for Fortran Usage , 1979, TOMS.

[21]  Sartaj Sahni,et al.  Parallel Matrix and Graph Algorithms , 1981, SIAM J. Comput..

[22]  H. T. Kung,et al.  I/O complexity: The red-blue pebble game , 1981, STOC '81.

[23]  Leslie G. Valiant,et al.  Size Bounds for Superconcentrators , 1983, Theor. Comput. Sci..

[24]  Gene H. Golub,et al.  Matrix computations , 1983 .

[25]  John Van Rosendale Minimizing Inner Product Data Dependencies in Conjugate Gradient Iteration , 1983, ICPP.

[26]  Dennis Gannon,et al.  On the Impact of Communication Complexity on the Design of Parallel Numerical Algorithms , 1984, IEEE Transactions on Computers.

[27]  Y. Saad,et al.  Practical Use of Polynomial Preconditionings for the Conjugate Gradient Method , 1985 .

[28]  Zhishun A. Liu,et al.  A Look Ahead Lanczos Algorithm for Unsymmetric Matrices , 1985 .

[29]  Christian H. Bischof,et al.  The WY representation for products of householder matrices , 1985, PPSC.

[30]  Danny C. Sorensen,et al.  Analysis of Pairwise Pivoting in Gaussian Elimination , 1985, IEEE Transactions on Computers.

[31]  R. Tarjan,et al.  The analysis of a nested dissection algorithm , 1987 .

[32]  H. Walker,et al.  Note on a Householder implementation of the GMRES method , 1986 .

[33]  Y. Saad,et al.  GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems , 1986 .

[34]  Ed Anderson,et al.  LAPACK Users' Guide , 1995 .

[35]  Jack Dongarra,et al.  ScaLAPACK Users' Guide , 1987 .

[36]  Jack Dongarra,et al.  LINPACK Users' Guide , 1987 .

[37]  François Irigoin,et al.  Supernode partitioning , 1988, POPL '88.

[38]  Jack J. Dongarra,et al.  Algorithm 656: an extended set of basic linear algebra subprograms: model implementation and test programs , 1988, TOMS.

[39]  G. Golub,et al.  Parallel block schemes for large-scale least-squares computations , 1988 .

[40]  H. Walker Implementation of the GMRES method using householder transformations , 1988 .

[41]  Alok Aggarwal,et al.  The input/output complexity of sorting and related problems , 1988, CACM.

[42]  Jack J. Dongarra,et al.  An extended set of FORTRAN basic linear algebra subprograms , 1988, TOMS.

[43]  Anthony T. Chronopoulos,et al.  On the efficient implementation of preconditioned s-step conjugate gradient methods on multiprocessors with memory hierarchy , 1989, Parallel Comput..

[44]  C. Loan,et al.  A Storage-Efficient $WY$ Representation for Products of Householder Transformations , 1989 .

[45]  Anthony T. Chronopoulos,et al.  s-step iterative methods for symmetric linear systems , 1989 .

[46]  Jack J. Dongarra,et al.  A set of level 3 basic linear algebra subprograms , 1990, TOMS.

[47]  Alok Aggarwal,et al.  Communication Complexity of PRAMs , 1990, Theor. Comput. Sci..

[48]  L. Reichel Newton interpolation at Leja points , 1990 .

[49]  L. Trefethen,et al.  Average-case stability of Gaussian elimination , 1990 .

[50]  Jack J. Dongarra,et al.  Algorithm 679: A set of level 3 basic linear algebra subprograms: model implementation and test programs , 1990, TOMS.

[51]  Graham F. Carey,et al.  Parallelizable Restarted Iterative Methods for Nonsymmetric Linear Systems , 1991, PPSC.

[52]  D. Hut A Newton Basis Gmres Implementation , 1991 .

[53]  Dan Hu,et al.  An Implementation of the GMRES Method Using QR Factorization , 1991, SIAM Conference on Parallel Processing for Scientific Computing.

[54]  W. Joubert,et al.  Parallelizable restarted iterative methods for nonsymmetric linear systems. part I: Theory , 1992 .

[55]  Danny C. Sorensen,et al.  Implicit Application of Polynomial Filters in a k-Step Arnoldi Method , 1992, SIAM J. Matrix Anal. Appl..

[56]  C. Puglisi Modification of the householder method based on the compact WY representation , 1992 .

[57]  Anthony T. Chronopoulos,et al.  An efficient nonsymmetric Lanczos method on parallel vector computers , 1992 .

[58]  S. Lennart Johnsson,et al.  Minimizing the Communication Time for Matrix Multiplication on Multiprocessors , 1993, Parallel Comput..

[59]  Edith Cohen,et al.  Estimating the size of the transitive closure in linear time , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[60]  Ramesh C. Agarwal,et al.  A high-performance matrix-multiplication algorithm on a distributed-memory parallel computer, using overlapped communication , 1994, IBM J. Res. Dev..

[61]  Richard Barrett,et al.  Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods , 1994, Other Titles in Applied Mathematics.

[62]  Ramesh C. Agarwal,et al.  A three-dimensional approach to parallel matrix multiplication , 1995, IBM J. Res. Dev..

[63]  Beresford N. Parlett,et al.  The New qd Algorithms , 1995, Acta Numerica.

[64]  Sivan Toledo,et al.  Quantitative performance modeling of scientific computations and creating locality in numerical algorithms , 1995 .

[65]  John E. Savage Extending the Hong-Kung Model to Memory Hierarchies , 1995, COCOON.

[66]  Anthony T. Chronopoulos,et al.  Parallel Iterative S-Step Methods for Unsymmetric Linear Systems , 1996, Parallel Comput..

[67]  Ming Gu,et al.  Efficient Algorithms for Computing a Strong Rank-Revealing QR Factorization , 1996, SIAM J. Sci. Comput..

[68]  Eric de Sturler,et al.  A Performance Model for Krylov Subspace Methods on Mesh-Based Parallel Computers , 1996, Parallel Comput..

[69]  Gene H. Golub,et al.  Matrix Computations, Third Edition , 1996 .

[70]  P. Sadayappan,et al.  Communication-Efficient Matrix Multiplication on Hypercubes , 1996, Parallel Comput..

[71]  Jack Dongarra,et al.  A Test Matrix Collection for Non-Hermitian Eigenvalue Problems , 1997 .

[72]  Anne Greenbaum,et al.  Iterative methods for solving linear systems , 1997, Frontiers in applied mathematics.

[73]  Sivan Toledo,et al.  Efficient Out-of-Core Algorithms for Linear Relaxation Using Blocking Covers , 1997, J. Comput. Syst. Sci..

[74]  Fred G. Gustavson,et al.  Recursion leads to automatic variable blocking for dense linear-algebra algorithms , 1997, IBM J. Res. Dev..

[75]  James Demmel,et al.  Applied Numerical Linear Algebra , 1997 .

[76]  Robert A. van de Geijn,et al.  SUMMA: scalable universal matrix multiplication algorithm , 1995, Concurr. Pract. Exp..

[77]  A. Greenbaum Estimating the Attainable Accuracy of Recursively Computed Residual Methods , 1997, SIAM J. Matrix Anal. Appl..

[78]  Edith Cohen,et al.  Size-Estimation Framework with Applications to Transitive Closure and Reachability , 1997, J. Comput. Syst. Sci..

[79]  M. Rozložník,et al.  Numerical behaviour of the modified gram-schmidt GMRES implementation , 1997 .

[80]  Martin H. Gutknecht,et al.  Lanczos-type solvers for nonsymmetric linear systems of equations , 1997, Acta Numerica.

[81]  Sivan Toledo Locality of Reference in LU Decomposition with Partial Pivoting , 1997 .

[82]  J. Demmel,et al.  An inverse free parallel spectral divide and conquer algorithm for nonsymmetric eigenproblems , 1997 .

[83]  Erik Elmroth,et al.  New Serial and Parallel Recursive QR Factorization Algorithms for SMP Systems , 1998, PARA.

[84]  Accuracy of two three-term and three two-term recurrences for Krylov space solvers , 1999 .

[85]  Denis Vanderstraeten,et al.  A Stable and Efficient Parallel Block Gram-Schmidt Algorithm , 1999, Euro-Par.

[86]  Franco P. Preparata,et al.  Processor—Time Tradeoffs under Bounded-Speed Message Propagation: Part II, Lower Bounds , 1999, Theory of Computing Systems.

[87]  Alexander Tiskin,et al.  Memory-Efficient Matrix Multiplication in the BSP Model , 1999, Algorithmica.

[88]  Ümit V. Çatalyürek,et al.  Hypergraph-Partitioning-Based Decomposition for Parallel Sparse-Matrix Vector Multiplication , 1999, IEEE Trans. Parallel Distributed Syst..

[89]  Jack Dongarra,et al.  Templates for the Solution of Algebraic Eigenvalue Problems , 2000, Software, environments, tools.

[90]  Frédéric Guyomarc'h,et al.  A Deflated Version of the Conjugate Gradient Algorithm , 1999, SIAM J. Sci. Comput..

[91]  Qiang Ye,et al.  Analysis of the finite precision bi-conjugate gradient algorithm for nonsymmetric linear systems , 2000, Math. Comput..

[92]  H. V. der Residual Replacement Strategies for Krylov Subspace Iterative Methods for the Convergence of True Residuals , 2000 .

[93]  David S. Wise Ahnentafel Indexing into Morton-Ordered Arrays, or Matrix Locality for Free , 2000, Euro-Par.

[94]  Qiang Ye,et al.  Residual Replacement Strategies for Krylov Subspace Iterative Methods for the Convergence of True Residuals , 2000, SIAM J. Sci. Comput..

[95]  Christian H. Bischof,et al.  A framework for symmetric band reduction , 2000, TOMS.

[96]  Christian H. Bischof,et al.  Algorithm 807: The SBR Toolbox—software for successive band reduction , 2000, TOMS.

[97]  Martin H. Gutknecht,et al.  Look-Ahead Procedures for Lanczos-Type Product Methods Based on Three-Term Lanczos Recurrences , 2000, SIAM J. Matrix Anal. Appl..

[98]  Andrea Pietracaprina,et al.  On the Space and Access Complexity of Computation DAGs , 2000, WG.

[99]  Zdenek Strakos,et al.  Accuracy of Two Three-term and Three Two-term Recurrences for Krylov Space Solvers , 2000, SIAM J. Matrix Anal. Appl..

[100]  Edmond Chow,et al.  A Priori Sparsity Patterns for Parallel Sparse Approximate Inverse Preconditioners , 1999, SIAM J. Sci. Comput..

[101]  Ulrich Rüde,et al.  Cache Optimization for Structured and Unstructured Grid Multigrid , 2000 .

[102]  Keshav Pingali,et al.  Automatic Generation of Block-Recursive Codes , 2000, Euro-Par.

[103]  E. Chow Parallel implementation and practical use of sparse approximate inverses with a priori sparsity patterns , 2001 .

[104]  Edmond Chow,et al.  Parallel Implementation and Practical Use of Sparse Approximate Inverse Preconditioners with a Priori Sparsity Patterns , 2001, Int. J. High Perform. Comput. Appl..

[105]  Larry Carter,et al.  Rescheduling for Locality in Sparse Matrix Computations , 2001, International Conference on Computational Science.

[106]  Ümit V. Çatalyürek,et al.  A fine-grain hypergraph model for 2D decomposition of sparse matrices , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[107]  Karen S. Braman,et al.  The Multishift QR Algorithm. Part I: Maintaining Well-Focused Shifts and Level 3 Performance , 2001, SIAM J. Matrix Anal. Appl..

[108]  Rudnei Dias da Cunha,et al.  New Parallel (Rank-Revealing) QR Factorization Algorithms , 2002, Euro-Par.

[109]  Karen S. Braman,et al.  The Multishift QR Algorithm. Part II: Aggressive Early Deflation , 2001, SIAM J. Matrix Anal. Appl..

[110]  Kesheng Wu,et al.  A Block Orthogonalization Procedure with Constant Synchronization Requirements , 2000, SIAM J. Sci. Comput..

[111]  A. Tiskin Bulk-Synchronous Parallel Gaussian Elimination , 2002 .

[112]  Richard Vuduc,et al.  Automatic performance tuning of sparse matrix kernels , 2003 .

[113]  S. SIAMJ. A ROBUST CRITERION FOR THE MODIFIED GRAM – SCHMIDT ALGORITHM WITH SELECTIVE REORTHOGONALIZATION , 2003 .

[114]  Jeremy D. Frens,et al.  QR factorization with Morton-ordered quadtree matrices for memory re-use and parallelism , 2003, PPoPP '03.

[115]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[116]  Viktor K. Prasanna,et al.  Optimizing graph algorithms for improved cache performance , 2002, IEEE Transactions on Parallel and Distributed Systems.

[117]  Marc Snir,et al.  GETTING UP TO SPEED THE FUTURE OF SUPERCOMPUTING , 2004 .

[118]  Dror Irony,et al.  Communication lower bounds for distributed-memory matrix multiplication , 2004, J. Parallel Distributed Comput..

[119]  T. Tao,et al.  Finite bounds for Hölder-Brascamp-Lieb multilinear inequalities , 2005, math/0505691.

[120]  M. Rozložník,et al.  The loss of orthogonality in the Gram-Schmidt orthogonalization process , 2005 .

[121]  Katherine Yelick,et al.  OSKI: A library of automatically tuned sparse matrix kernels , 2005 .

[122]  Rajeev Thakur,et al.  Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..

[123]  Volker Strumpen,et al.  Cache oblivious stencil computations , 2005, ICS '05.

[124]  Robert A. van de Geijn,et al.  Parallel out-of-core computation and updating of the QR factorization , 2005, TOMS.

[125]  Telecommunications Board,et al.  Getting Up to Speed: The Future of Supercomputing , 2005 .

[126]  Gerard L. G. Sleijpen,et al.  Reliable updated residuals in hybrid Bi-CG methods , 1996, Computing.

[127]  Rob H. Bisseling,et al.  Parallel hypergraph partitioning for scientific computing , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[128]  Julien Langou,et al.  A note on the error analysis of classical Gram–Schmidt , 2006, Numerische Mathematik.

[129]  G. Meurant The Lanczos and Conjugate Gradient Algorithms: From Theory to Finite Precision Computations , 2006 .

[130]  B. Kågström,et al.  The Multishift QZ Algorithm with Aggressive Early Deflation ? , 2006 .

[131]  G. Meurant,et al.  The Lanczos and conjugate gradient algorithms in finite precision arithmetic , 2006, Acta Numerica.

[132]  Volker Strumpen,et al.  The Cache Complexity of Multithreaded Cache Oblivious Algorithms , 2009, SPAA '06.

[133]  Alexander Tiskin Communication-efficient parallel generic pairwise elimination , 2007, Future Gener. Comput. Syst..

[134]  J. Demmel,et al.  Avoiding Communication in Computing Krylov Subspaces , 2007 .

[135]  Michael A. Bender,et al.  Optimal Sparse Matrix Dense Vector Multiplication in the I/O-Model , 2007, SPAA '07.

[136]  James Demmel,et al.  Fast matrix multiplication is stable , 2006, Numerische Mathematik.

[137]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[138]  Keshav Pingali,et al.  An experimental comparison of cache-oblivious and cache-conscious programs , 2007, SPAA '07.

[139]  Robert A. van de Geijn,et al.  Collective communication: theory, practice, and experience , 2007, Concurr. Comput. Pract. Exp..

[140]  James Demmel,et al.  When cache blocking of sparse matrix vector multiply works and why , 2007, Applicable Algebra in Engineering, Communication and Computing.

[141]  James Demmel,et al.  Fast linear algebra is stable , 2006, Numerische Mathematik.

[142]  James Demmel,et al.  Cache efficient bidiagonalization using BLAS 2.5 operators , 2008, TOMS.

[143]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[144]  James Demmel,et al.  Avoiding communication in sparse matrix computations , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[145]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[146]  Michael M. Wolf,et al.  Optimizing Parallel Sparse Matrix-Vector Multiplication by Corner Partitioning , 2008 .

[147]  James Demmel,et al.  Performance and Accuracy of LAPACK's Symmetric Tridiagonal Eigensolvers , 2008, SIAM J. Sci. Comput..

[148]  Bruce Hendrickson,et al.  Optimizing parallel sparse matrix-vector multiplication by partitioning. , 2008 .

[149]  G. W. Stewart Block Gram--Schmidt Orthogonalization , 2008, SIAM J. Sci. Comput..

[150]  Weiqiang Wang,et al.  A Multilevel Parallelization Framework for High-Order Stencil Computations , 2009, Euro-Par.

[151]  Julien Langou,et al.  A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures , 2007, Parallel Comput..

[152]  M. Tartibi,et al.  2 Shared Memory Implementation , 2009 .

[153]  Tamara G. Kolda,et al.  Tensor Decompositions and Applications , 2009, SIAM Rev..

[154]  James Demmel,et al.  Minimizing communication in sparse matrix solvers , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[155]  Alan LaMielle,et al.  Computer Science Technical Report Enabling Code Generation within the Sparse Polyhedral Framework Enabling Code Generation within the Sparse Polyhedral Framework , 2010 .

[156]  Riko Jacob,et al.  The I/O Complexity of Sparse Matrix Dense Matrix Multiplication , 2010, LATIN.

[157]  Riko Jacob,et al.  Evaluating Non-square Sparse Bilinear Forms on Multiple Vector Pairs in the I/O-Model , 2010, MFCS.

[158]  James Demmel,et al.  Brief announcement: Lower bounds on communication for sparse Cholesky factorization of a model problem , 2010, SPAA '10.

[159]  Jianlin Xia,et al.  Fast algorithms for hierarchically semiseparable matrices , 2010, Numer. Linear Algebra Appl..

[160]  Daniel Kressner,et al.  On Aggressive Early Deflation in Parallel Variants of the QR Algorithm , 2010, PARA.

[161]  Mark Hoemmen,et al.  Communication-avoiding Krylov subspace methods , 2010 .

[162]  James Demmel,et al.  Communication-optimal Parallel and Sequential Cholesky Decomposition , 2009, SIAM J. Sci. Comput..

[163]  James Demmel,et al.  Brief announcement: communication bounds for heterogeneous architectures , 2011, SPAA '11.

[164]  Lothar Reichel,et al.  On the generation of Krylov subspace bases , 2012 .

[165]  James Demmel,et al.  CALU: A Communication Optimal LU Factorization Algorithm , 2011, SIAM J. Matrix Anal. Appl..

[166]  James Demmel,et al.  Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms , 2011, Euro-Par.

[167]  Rob H. Bisseling,et al.  Two-dimensional cache-oblivious sparse matrix-vector multiplication , 2011, Parallel Comput..

[168]  James Demmel,et al.  Communication-Avoiding QR Decomposition for GPUs , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[169]  Lars Karlsson,et al.  Parallel two-stage reduction to Hessenberg form using dynamic scheduling on shared-memory architectures , 2011, Parallel Comput..

[170]  Gil Shklarski,et al.  Partitioned Triangular Tridiagonalization , 2011, TOMS.

[171]  Bradley C. Kuszmaul,et al.  The pochoir stencil compiler , 2011, SPAA '11.

[172]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[173]  James Demmel,et al.  Minimizing Communication in Numerical Linear Algebra , 2009, SIAM J. Matrix Anal. Appl..

[174]  Sraban Kumar Mohanty,et al.  I/O efficient QR and QZ algorithms , 2012, 2012 19th International Conference on High Performance Computing.

[175]  Katherine Yelick,et al.  Autotuning Sparse Matrix-Vector Multiplication for Multicore , 2012 .

[176]  James Demmel,et al.  Communication-optimal Parallel and Sequential QR and LU Factorizations , 2008, SIAM J. Sci. Comput..

[177]  Virginia Vassilevska Williams,et al.  Multiplying matrices faster than coppersmith-winograd , 2012, STOC '12.

[178]  James Demmel,et al.  Communication avoiding successive band reduction , 2012, PPoPP '12.

[179]  James Demmel,et al.  Graph Expansion Analysis for Communication Costs of Fast Rectangular Matrix Multiplication , 2012, MedAlg.

[180]  James Demmel,et al.  Brief announcement: strong scaling of matrix multiplication algorithms and memory-independent communication lower bounds , 2012, SPAA '12.

[181]  Marghoob Mohiyuddin,et al.  Tuning Hardware and Software for Multiprocessors , 2012 .

[182]  Matteo Frigo,et al.  Cache-oblivious algorithms , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[183]  James Demmel,et al.  Communication-optimal parallel algorithm for strassen's matrix multiplication , 2012, SPAA '12.

[184]  J. Demmel,et al.  Sequential Communication Bounds for Fast Linear Algebra , 2012 .

[185]  Katherine A. Yelick,et al.  Communication avoiding and overlapping for numerical linear algebra , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[186]  James Demmel,et al.  Communication-Avoiding Parallel Strassen: Implementation and performance , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[187]  James Demmel,et al.  Minimizing Communication in All-Pairs Shortest Paths , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[188]  James Demmel,et al.  Cyclops Tensor Framework: Reducing Communication and Eliminating Load Imbalance in Massively Parallel Contractions , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[189]  Grey Ballard,et al.  Avoiding Communication in Dense Linear Algebra , 2013 .

[190]  Graph expansion and communication costs of fast matrix multiplication , 2012, JACM.

[191]  Katherine A. Yelick,et al.  A Communication-Optimal N-Body Algorithm for Direct Interactions , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[192]  James Demmel,et al.  Implementing a Blocked Aasen's Algorithm with a Dynamic Scheduler on Multicore Architectures , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[193]  Benjamin Lipshitz,et al.  Communication-Avoiding Parallel Recursive Algorithms for Matrix Multiplication , 2013 .

[194]  Nicholas J. Higham,et al.  Stable and Efficient Spectral Divide and Conquer Algorithms for the Symmetric Eigenvalue Decomposition and the SVD , 2013, SIAM J. Sci. Comput..

[195]  James Demmel,et al.  Perfect Strong Scaling Using No Additional Energy , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[196]  James Demmel,et al.  Communication efficient gaussian elimination with partial pivoting using a shape morphing data layout , 2013, SPAA.

[197]  James Demmel,et al.  Communication-Optimal Parallel Recursive Rectangular Matrix Multiplication , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[198]  James Demmel,et al.  Exploiting Data Sparsity in Parallel Matrix Powers Computations , 2013, PPAM.

[199]  J. Demmel An arithmetic complexity lower bound for computing rational functions, with applications to linear algebra , 2013 .

[200]  James Demmel,et al.  Communication optimal parallel multiplication of sparse random matrices , 2013, SPAA.

[201]  James Demmel,et al.  Avoiding Communication in Nonsymmetric Lanczos-Based Krylov Subspace Methods , 2013, SIAM J. Sci. Comput..

[202]  Riko Jacob,et al.  Tight Bounds for Low Dimensional Star Stencils in the External Memory Model , 2012, WADS.

[203]  James Demmel,et al.  Communication lower bounds and optimal algorithms for programs that reference arrays - Part 1 , 2013, ArXiv.

[204]  Piotr Luszczek,et al.  An improved parallel singular value algorithm and its implementation for multicore hardware , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[205]  Samuel H. Fuller,et al.  The Future of Computing Performance: Game Over or Next Level? , 2014 .

[206]  J. Demmel,et al.  Tradeoffs between synchronization, communication, and computation in parallel linear algebra computations , 2014, SPAA.

[207]  Michele Scquizzato,et al.  Communication Lower Bounds for Distributed-Memory Computations , 2013, STACS.

[208]  J. Demmel,et al.  Tradeoffs between synchronization , communication , and work in parallel linear algebra computations , 2014 .

[209]  James Demmel,et al.  A Residual Replacement Strategy for Improving the Maximum Attainable Accuracy of s-Step Krylov Subspace Methods , 2014, SIAM J. Matrix Anal. Appl..

[210]  Samuel Williams,et al.  s-Step Krylov Subspace Methods as Bottom Solvers for Geometric Multigrid , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[211]  James Demmel,et al.  Reconstructing Householder Vectors from Tall-Skinny QR , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[212]  James Demmel,et al.  Communication Avoiding Rank Revealing QR Factorization with Column Pivoting , 2015, SIAM J. Matrix Anal. Appl..

[213]  James Demmel,et al.  Avoiding Communication in Successive Band Reduction , 2015, ACM Trans. Parallel Comput..

[214]  W. Jalbyf,et al.  STABILITY ANALYSIS AND IMPROVEMENT OF THE BLOCK GRAM-SCHMIDT ALGORITHM , .