Communication lower bounds and optimal algorithms for numerical linear algebra*†
暂无分享,去创建一个
James Demmel | Oded Schwartz | Grey Ballard | Nicholas Knight | Mark Hoemmen | Erin Carson | J. Demmel | Grey Ballard | E. Carson | M. Hoemmen | Nicholas Knight | O. Schwartz
[1] G. Miller. On the Solution of a System of Linear Equations , 1910 .
[2] H. Whitney,et al. An inequality related to the isoperimetric inequality , 1949 .
[3] C. Lanczos. An iteration method for the solution of the eigenvalue problem of linear differential and integral operators , 1950 .
[4] W. Arnoldi. The principle of minimized iterations in the solution of the matrix eigenvalue problem , 1951 .
[5] M. Hestenes,et al. Methods of conjugate gradients for solving linear systems , 1952 .
[6] Stephen Warshall,et al. A Theorem on Boolean Matrices , 1962, JACM.
[7] Stephen J. Garland,et al. Algorithm 97: Shortest path , 1962, Commun. ACM.
[8] Å. Björck. Solving linear least squares problems by Gram-Schmidt orthogonalization , 1967 .
[9] V. Strassen. Gaussian elimination is not optimal , 1969 .
[10] Lynn Elliot Cannon,et al. A cellular computer to implement the kalman filter algorithm , 1969 .
[11] N. Abdelmalek. Round off error analysis for Gram-Schmidt method and solution of linear least squares problems , 1971 .
[12] J. O. Aasen. On the reduction of a symmetric matrix to tridiagonal form , 1971 .
[13] D. Rose,et al. Complexity Bounds for Regular Finite Difference and Finite Element Grids , 1973 .
[14] A. George. Nested Dissection of a Regular Finite Element Mesh , 1973 .
[15] A. Kiełbasiński. Analiza numeryczna algorytmu ortogonalizacji Grama-Schmidta , 1974 .
[16] J. Bunch,et al. Some stable methods for calculating inertia and solving symmetric linear systems , 1977 .
[17] Jack J. Dongarra,et al. Matrix Eigensystem Routines - EISPACK Guide, Second Edition , 1976, Lecture Notes in Computer Science.
[18] B. S. Garbow,et al. Matrix Eigensystem Routines — EISPACK Guide , 1974, Lecture Notes in Computer Science.
[19] Brian T. Smith,et al. Matrix Eigensystem Routines — EISPACK Guide , 1974, Lecture Notes in Computer Science.
[20] Charles L. Lawson,et al. Basic Linear Algebra Subprograms for Fortran Usage , 1979, TOMS.
[21] Sartaj Sahni,et al. Parallel Matrix and Graph Algorithms , 1981, SIAM J. Comput..
[22] H. T. Kung,et al. I/O complexity: The red-blue pebble game , 1981, STOC '81.
[23] Leslie G. Valiant,et al. Size Bounds for Superconcentrators , 1983, Theor. Comput. Sci..
[24] Gene H. Golub,et al. Matrix computations , 1983 .
[25] John Van Rosendale. Minimizing Inner Product Data Dependencies in Conjugate Gradient Iteration , 1983, ICPP.
[26] Dennis Gannon,et al. On the Impact of Communication Complexity on the Design of Parallel Numerical Algorithms , 1984, IEEE Transactions on Computers.
[27] Y. Saad,et al. Practical Use of Polynomial Preconditionings for the Conjugate Gradient Method , 1985 .
[28] Zhishun A. Liu,et al. A Look Ahead Lanczos Algorithm for Unsymmetric Matrices , 1985 .
[29] Christian H. Bischof,et al. The WY representation for products of householder matrices , 1985, PPSC.
[30] Danny C. Sorensen,et al. Analysis of Pairwise Pivoting in Gaussian Elimination , 1985, IEEE Transactions on Computers.
[31] R. Tarjan,et al. The analysis of a nested dissection algorithm , 1987 .
[32] H. Walker,et al. Note on a Householder implementation of the GMRES method , 1986 .
[33] Y. Saad,et al. GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems , 1986 .
[34] Ed Anderson,et al. LAPACK Users' Guide , 1995 .
[35] Jack Dongarra,et al. ScaLAPACK Users' Guide , 1987 .
[36] Jack Dongarra,et al. LINPACK Users' Guide , 1987 .
[37] François Irigoin,et al. Supernode partitioning , 1988, POPL '88.
[38] Jack J. Dongarra,et al. Algorithm 656: an extended set of basic linear algebra subprograms: model implementation and test programs , 1988, TOMS.
[39] G. Golub,et al. Parallel block schemes for large-scale least-squares computations , 1988 .
[40] H. Walker. Implementation of the GMRES method using householder transformations , 1988 .
[41] Alok Aggarwal,et al. The input/output complexity of sorting and related problems , 1988, CACM.
[42] Jack J. Dongarra,et al. An extended set of FORTRAN basic linear algebra subprograms , 1988, TOMS.
[43] Anthony T. Chronopoulos,et al. On the efficient implementation of preconditioned s-step conjugate gradient methods on multiprocessors with memory hierarchy , 1989, Parallel Comput..
[44] C. Loan,et al. A Storage-Efficient $WY$ Representation for Products of Householder Transformations , 1989 .
[45] Anthony T. Chronopoulos,et al. s-step iterative methods for symmetric linear systems , 1989 .
[46] Jack J. Dongarra,et al. A set of level 3 basic linear algebra subprograms , 1990, TOMS.
[47] Alok Aggarwal,et al. Communication Complexity of PRAMs , 1990, Theor. Comput. Sci..
[48] L. Reichel. Newton interpolation at Leja points , 1990 .
[49] L. Trefethen,et al. Average-case stability of Gaussian elimination , 1990 .
[50] Jack J. Dongarra,et al. Algorithm 679: A set of level 3 basic linear algebra subprograms: model implementation and test programs , 1990, TOMS.
[51] Graham F. Carey,et al. Parallelizable Restarted Iterative Methods for Nonsymmetric Linear Systems , 1991, PPSC.
[52] D. Hut. A Newton Basis Gmres Implementation , 1991 .
[53] Dan Hu,et al. An Implementation of the GMRES Method Using QR Factorization , 1991, SIAM Conference on Parallel Processing for Scientific Computing.
[54] W. Joubert,et al. Parallelizable restarted iterative methods for nonsymmetric linear systems. part I: Theory , 1992 .
[55] Danny C. Sorensen,et al. Implicit Application of Polynomial Filters in a k-Step Arnoldi Method , 1992, SIAM J. Matrix Anal. Appl..
[56] C. Puglisi. Modification of the householder method based on the compact WY representation , 1992 .
[57] Anthony T. Chronopoulos,et al. An efficient nonsymmetric Lanczos method on parallel vector computers , 1992 .
[58] S. Lennart Johnsson,et al. Minimizing the Communication Time for Matrix Multiplication on Multiprocessors , 1993, Parallel Comput..
[59] Edith Cohen,et al. Estimating the size of the transitive closure in linear time , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.
[60] Ramesh C. Agarwal,et al. A high-performance matrix-multiplication algorithm on a distributed-memory parallel computer, using overlapped communication , 1994, IBM J. Res. Dev..
[61] Richard Barrett,et al. Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods , 1994, Other Titles in Applied Mathematics.
[62] Ramesh C. Agarwal,et al. A three-dimensional approach to parallel matrix multiplication , 1995, IBM J. Res. Dev..
[63] Beresford N. Parlett,et al. The New qd Algorithms , 1995, Acta Numerica.
[64] Sivan Toledo,et al. Quantitative performance modeling of scientific computations and creating locality in numerical algorithms , 1995 .
[65] John E. Savage. Extending the Hong-Kung Model to Memory Hierarchies , 1995, COCOON.
[66] Anthony T. Chronopoulos,et al. Parallel Iterative S-Step Methods for Unsymmetric Linear Systems , 1996, Parallel Comput..
[67] Ming Gu,et al. Efficient Algorithms for Computing a Strong Rank-Revealing QR Factorization , 1996, SIAM J. Sci. Comput..
[68] Eric de Sturler,et al. A Performance Model for Krylov Subspace Methods on Mesh-Based Parallel Computers , 1996, Parallel Comput..
[69] Gene H. Golub,et al. Matrix Computations, Third Edition , 1996 .
[70] P. Sadayappan,et al. Communication-Efficient Matrix Multiplication on Hypercubes , 1996, Parallel Comput..
[71] Jack Dongarra,et al. A Test Matrix Collection for Non-Hermitian Eigenvalue Problems , 1997 .
[72] Anne Greenbaum,et al. Iterative methods for solving linear systems , 1997, Frontiers in applied mathematics.
[73] Sivan Toledo,et al. Efficient Out-of-Core Algorithms for Linear Relaxation Using Blocking Covers , 1997, J. Comput. Syst. Sci..
[74] Fred G. Gustavson,et al. Recursion leads to automatic variable blocking for dense linear-algebra algorithms , 1997, IBM J. Res. Dev..
[75] James Demmel,et al. Applied Numerical Linear Algebra , 1997 .
[76] Robert A. van de Geijn,et al. SUMMA: scalable universal matrix multiplication algorithm , 1995, Concurr. Pract. Exp..
[77] A. Greenbaum. Estimating the Attainable Accuracy of Recursively Computed Residual Methods , 1997, SIAM J. Matrix Anal. Appl..
[78] Edith Cohen,et al. Size-Estimation Framework with Applications to Transitive Closure and Reachability , 1997, J. Comput. Syst. Sci..
[79] M. Rozložník,et al. Numerical behaviour of the modified gram-schmidt GMRES implementation , 1997 .
[80] Martin H. Gutknecht,et al. Lanczos-type solvers for nonsymmetric linear systems of equations , 1997, Acta Numerica.
[81] Sivan Toledo. Locality of Reference in LU Decomposition with Partial Pivoting , 1997 .
[82] J. Demmel,et al. An inverse free parallel spectral divide and conquer algorithm for nonsymmetric eigenproblems , 1997 .
[83] Erik Elmroth,et al. New Serial and Parallel Recursive QR Factorization Algorithms for SMP Systems , 1998, PARA.
[84] Accuracy of two three-term and three two-term recurrences for Krylov space solvers , 1999 .
[85] Denis Vanderstraeten,et al. A Stable and Efficient Parallel Block Gram-Schmidt Algorithm , 1999, Euro-Par.
[86] Franco P. Preparata,et al. Processor—Time Tradeoffs under Bounded-Speed Message Propagation: Part II, Lower Bounds , 1999, Theory of Computing Systems.
[87] Alexander Tiskin,et al. Memory-Efficient Matrix Multiplication in the BSP Model , 1999, Algorithmica.
[88] Ümit V. Çatalyürek,et al. Hypergraph-Partitioning-Based Decomposition for Parallel Sparse-Matrix Vector Multiplication , 1999, IEEE Trans. Parallel Distributed Syst..
[89] Jack Dongarra,et al. Templates for the Solution of Algebraic Eigenvalue Problems , 2000, Software, environments, tools.
[90] Frédéric Guyomarc'h,et al. A Deflated Version of the Conjugate Gradient Algorithm , 1999, SIAM J. Sci. Comput..
[91] Qiang Ye,et al. Analysis of the finite precision bi-conjugate gradient algorithm for nonsymmetric linear systems , 2000, Math. Comput..
[92] H. V. der. Residual Replacement Strategies for Krylov Subspace Iterative Methods for the Convergence of True Residuals , 2000 .
[93] David S. Wise. Ahnentafel Indexing into Morton-Ordered Arrays, or Matrix Locality for Free , 2000, Euro-Par.
[94] Qiang Ye,et al. Residual Replacement Strategies for Krylov Subspace Iterative Methods for the Convergence of True Residuals , 2000, SIAM J. Sci. Comput..
[95] Christian H. Bischof,et al. A framework for symmetric band reduction , 2000, TOMS.
[96] Christian H. Bischof,et al. Algorithm 807: The SBR Toolbox—software for successive band reduction , 2000, TOMS.
[97] Martin H. Gutknecht,et al. Look-Ahead Procedures for Lanczos-Type Product Methods Based on Three-Term Lanczos Recurrences , 2000, SIAM J. Matrix Anal. Appl..
[98] Andrea Pietracaprina,et al. On the Space and Access Complexity of Computation DAGs , 2000, WG.
[99] Zdenek Strakos,et al. Accuracy of Two Three-term and Three Two-term Recurrences for Krylov Space Solvers , 2000, SIAM J. Matrix Anal. Appl..
[100] Edmond Chow,et al. A Priori Sparsity Patterns for Parallel Sparse Approximate Inverse Preconditioners , 1999, SIAM J. Sci. Comput..
[101] Ulrich Rüde,et al. Cache Optimization for Structured and Unstructured Grid Multigrid , 2000 .
[102] Keshav Pingali,et al. Automatic Generation of Block-Recursive Codes , 2000, Euro-Par.
[103] E. Chow. Parallel implementation and practical use of sparse approximate inverses with a priori sparsity patterns , 2001 .
[104] Edmond Chow,et al. Parallel Implementation and Practical Use of Sparse Approximate Inverse Preconditioners with a Priori Sparsity Patterns , 2001, Int. J. High Perform. Comput. Appl..
[105] Larry Carter,et al. Rescheduling for Locality in Sparse Matrix Computations , 2001, International Conference on Computational Science.
[106] Ümit V. Çatalyürek,et al. A fine-grain hypergraph model for 2D decomposition of sparse matrices , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.
[107] Karen S. Braman,et al. The Multishift QR Algorithm. Part I: Maintaining Well-Focused Shifts and Level 3 Performance , 2001, SIAM J. Matrix Anal. Appl..
[108] Rudnei Dias da Cunha,et al. New Parallel (Rank-Revealing) QR Factorization Algorithms , 2002, Euro-Par.
[109] Karen S. Braman,et al. The Multishift QR Algorithm. Part II: Aggressive Early Deflation , 2001, SIAM J. Matrix Anal. Appl..
[110] Kesheng Wu,et al. A Block Orthogonalization Procedure with Constant Synchronization Requirements , 2000, SIAM J. Sci. Comput..
[111] A. Tiskin. Bulk-Synchronous Parallel Gaussian Elimination , 2002 .
[112] Richard Vuduc,et al. Automatic performance tuning of sparse matrix kernels , 2003 .
[113] S. SIAMJ.. A ROBUST CRITERION FOR THE MODIFIED GRAM – SCHMIDT ALGORITHM WITH SELECTIVE REORTHOGONALIZATION , 2003 .
[114] Jeremy D. Frens,et al. QR factorization with Morton-ordered quadtree matrices for memory re-use and parallelism , 2003, PPoPP '03.
[115] Yousef Saad,et al. Iterative methods for sparse linear systems , 2003 .
[116] Viktor K. Prasanna,et al. Optimizing graph algorithms for improved cache performance , 2002, IEEE Transactions on Parallel and Distributed Systems.
[117] Marc Snir,et al. GETTING UP TO SPEED THE FUTURE OF SUPERCOMPUTING , 2004 .
[118] Dror Irony,et al. Communication lower bounds for distributed-memory matrix multiplication , 2004, J. Parallel Distributed Comput..
[119] T. Tao,et al. Finite bounds for Hölder-Brascamp-Lieb multilinear inequalities , 2005, math/0505691.
[120] M. Rozložník,et al. The loss of orthogonality in the Gram-Schmidt orthogonalization process , 2005 .
[121] Katherine Yelick,et al. OSKI: A library of automatically tuned sparse matrix kernels , 2005 .
[122] Rajeev Thakur,et al. Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..
[123] Volker Strumpen,et al. Cache oblivious stencil computations , 2005, ICS '05.
[124] Robert A. van de Geijn,et al. Parallel out-of-core computation and updating of the QR factorization , 2005, TOMS.
[125] Telecommunications Board,et al. Getting Up to Speed: The Future of Supercomputing , 2005 .
[126] Gerard L. G. Sleijpen,et al. Reliable updated residuals in hybrid Bi-CG methods , 1996, Computing.
[127] Rob H. Bisseling,et al. Parallel hypergraph partitioning for scientific computing , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.
[128] Julien Langou,et al. A note on the error analysis of classical Gram–Schmidt , 2006, Numerische Mathematik.
[129] G. Meurant. The Lanczos and Conjugate Gradient Algorithms: From Theory to Finite Precision Computations , 2006 .
[130] B. Kågström,et al. The Multishift QZ Algorithm with Aggressive Early Deflation ? , 2006 .
[131] G. Meurant,et al. The Lanczos and conjugate gradient algorithms in finite precision arithmetic , 2006, Acta Numerica.
[132] Volker Strumpen,et al. The Cache Complexity of Multithreaded Cache Oblivious Algorithms , 2009, SPAA '06.
[133] Alexander Tiskin. Communication-efficient parallel generic pairwise elimination , 2007, Future Gener. Comput. Syst..
[134] J. Demmel,et al. Avoiding Communication in Computing Krylov Subspaces , 2007 .
[135] Michael A. Bender,et al. Optimal Sparse Matrix Dense Vector Multiplication in the I/O-Model , 2007, SPAA '07.
[136] James Demmel,et al. Fast matrix multiplication is stable , 2006, Numerische Mathematik.
[137] Samuel Williams,et al. Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).
[138] Keshav Pingali,et al. An experimental comparison of cache-oblivious and cache-conscious programs , 2007, SPAA '07.
[139] Robert A. van de Geijn,et al. Collective communication: theory, practice, and experience , 2007, Concurr. Comput. Pract. Exp..
[140] James Demmel,et al. When cache blocking of sparse matrix vector multiply works and why , 2007, Applicable Algebra in Engineering, Communication and Computing.
[141] James Demmel,et al. Fast linear algebra is stable , 2006, Numerische Mathematik.
[142] James Demmel,et al. Cache efficient bidiagonalization using BLAS 2.5 operators , 2008, TOMS.
[143] Samuel Williams,et al. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.
[144] James Demmel,et al. Avoiding communication in sparse matrix computations , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.
[145] Samuel Williams,et al. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.
[146] Michael M. Wolf,et al. Optimizing Parallel Sparse Matrix-Vector Multiplication by Corner Partitioning , 2008 .
[147] James Demmel,et al. Performance and Accuracy of LAPACK's Symmetric Tridiagonal Eigensolvers , 2008, SIAM J. Sci. Comput..
[148] Bruce Hendrickson,et al. Optimizing parallel sparse matrix-vector multiplication by partitioning. , 2008 .
[149] G. W. Stewart. Block Gram--Schmidt Orthogonalization , 2008, SIAM J. Sci. Comput..
[150] Weiqiang Wang,et al. A Multilevel Parallelization Framework for High-Order Stencil Computations , 2009, Euro-Par.
[151] Julien Langou,et al. A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures , 2007, Parallel Comput..
[152] M. Tartibi,et al. 2 Shared Memory Implementation , 2009 .
[153] Tamara G. Kolda,et al. Tensor Decompositions and Applications , 2009, SIAM Rev..
[154] James Demmel,et al. Minimizing communication in sparse matrix solvers , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.
[155] Alan LaMielle,et al. Computer Science Technical Report Enabling Code Generation within the Sparse Polyhedral Framework Enabling Code Generation within the Sparse Polyhedral Framework , 2010 .
[156] Riko Jacob,et al. The I/O Complexity of Sparse Matrix Dense Matrix Multiplication , 2010, LATIN.
[157] Riko Jacob,et al. Evaluating Non-square Sparse Bilinear Forms on Multiple Vector Pairs in the I/O-Model , 2010, MFCS.
[158] James Demmel,et al. Brief announcement: Lower bounds on communication for sparse Cholesky factorization of a model problem , 2010, SPAA '10.
[159] Jianlin Xia,et al. Fast algorithms for hierarchically semiseparable matrices , 2010, Numer. Linear Algebra Appl..
[160] Daniel Kressner,et al. On Aggressive Early Deflation in Parallel Variants of the QR Algorithm , 2010, PARA.
[161] Mark Hoemmen,et al. Communication-avoiding Krylov subspace methods , 2010 .
[162] James Demmel,et al. Communication-optimal Parallel and Sequential Cholesky Decomposition , 2009, SIAM J. Sci. Comput..
[163] James Demmel,et al. Brief announcement: communication bounds for heterogeneous architectures , 2011, SPAA '11.
[164] Lothar Reichel,et al. On the generation of Krylov subspace bases , 2012 .
[165] James Demmel,et al. CALU: A Communication Optimal LU Factorization Algorithm , 2011, SIAM J. Matrix Anal. Appl..
[166] James Demmel,et al. Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms , 2011, Euro-Par.
[167] Rob H. Bisseling,et al. Two-dimensional cache-oblivious sparse matrix-vector multiplication , 2011, Parallel Comput..
[168] James Demmel,et al. Communication-Avoiding QR Decomposition for GPUs , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.
[169] Lars Karlsson,et al. Parallel two-stage reduction to Hessenberg form using dynamic scheduling on shared-memory architectures , 2011, Parallel Comput..
[170] Gil Shklarski,et al. Partitioned Triangular Tridiagonalization , 2011, TOMS.
[171] Bradley C. Kuszmaul,et al. The pochoir stencil compiler , 2011, SPAA '11.
[172] Timothy A. Davis,et al. The university of Florida sparse matrix collection , 2011, TOMS.
[173] James Demmel,et al. Minimizing Communication in Numerical Linear Algebra , 2009, SIAM J. Matrix Anal. Appl..
[174] Sraban Kumar Mohanty,et al. I/O efficient QR and QZ algorithms , 2012, 2012 19th International Conference on High Performance Computing.
[175] Katherine Yelick,et al. Autotuning Sparse Matrix-Vector Multiplication for Multicore , 2012 .
[176] James Demmel,et al. Communication-optimal Parallel and Sequential QR and LU Factorizations , 2008, SIAM J. Sci. Comput..
[177] Virginia Vassilevska Williams,et al. Multiplying matrices faster than coppersmith-winograd , 2012, STOC '12.
[178] James Demmel,et al. Communication avoiding successive band reduction , 2012, PPoPP '12.
[179] James Demmel,et al. Graph Expansion Analysis for Communication Costs of Fast Rectangular Matrix Multiplication , 2012, MedAlg.
[180] James Demmel,et al. Brief announcement: strong scaling of matrix multiplication algorithms and memory-independent communication lower bounds , 2012, SPAA '12.
[181] Marghoob Mohiyuddin,et al. Tuning Hardware and Software for Multiprocessors , 2012 .
[182] Matteo Frigo,et al. Cache-oblivious algorithms , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).
[183] James Demmel,et al. Communication-optimal parallel algorithm for strassen's matrix multiplication , 2012, SPAA '12.
[184] J. Demmel,et al. Sequential Communication Bounds for Fast Linear Algebra , 2012 .
[185] Katherine A. Yelick,et al. Communication avoiding and overlapping for numerical linear algebra , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.
[186] James Demmel,et al. Communication-Avoiding Parallel Strassen: Implementation and performance , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.
[187] James Demmel,et al. Minimizing Communication in All-Pairs Shortest Paths , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.
[188] James Demmel,et al. Cyclops Tensor Framework: Reducing Communication and Eliminating Load Imbalance in Massively Parallel Contractions , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.
[189] Grey Ballard,et al. Avoiding Communication in Dense Linear Algebra , 2013 .
[190] Graph expansion and communication costs of fast matrix multiplication , 2012, JACM.
[191] Katherine A. Yelick,et al. A Communication-Optimal N-Body Algorithm for Direct Interactions , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.
[192] James Demmel,et al. Implementing a Blocked Aasen's Algorithm with a Dynamic Scheduler on Multicore Architectures , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.
[193] Benjamin Lipshitz,et al. Communication-Avoiding Parallel Recursive Algorithms for Matrix Multiplication , 2013 .
[194] Nicholas J. Higham,et al. Stable and Efficient Spectral Divide and Conquer Algorithms for the Symmetric Eigenvalue Decomposition and the SVD , 2013, SIAM J. Sci. Comput..
[195] James Demmel,et al. Perfect Strong Scaling Using No Additional Energy , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.
[196] James Demmel,et al. Communication efficient gaussian elimination with partial pivoting using a shape morphing data layout , 2013, SPAA.
[197] James Demmel,et al. Communication-Optimal Parallel Recursive Rectangular Matrix Multiplication , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.
[198] James Demmel,et al. Exploiting Data Sparsity in Parallel Matrix Powers Computations , 2013, PPAM.
[199] J. Demmel. An arithmetic complexity lower bound for computing rational functions, with applications to linear algebra , 2013 .
[200] James Demmel,et al. Communication optimal parallel multiplication of sparse random matrices , 2013, SPAA.
[201] James Demmel,et al. Avoiding Communication in Nonsymmetric Lanczos-Based Krylov Subspace Methods , 2013, SIAM J. Sci. Comput..
[202] Riko Jacob,et al. Tight Bounds for Low Dimensional Star Stencils in the External Memory Model , 2012, WADS.
[203] James Demmel,et al. Communication lower bounds and optimal algorithms for programs that reference arrays - Part 1 , 2013, ArXiv.
[204] Piotr Luszczek,et al. An improved parallel singular value algorithm and its implementation for multicore hardware , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[205] Samuel H. Fuller,et al. The Future of Computing Performance: Game Over or Next Level? , 2014 .
[206] J. Demmel,et al. Tradeoffs between synchronization, communication, and computation in parallel linear algebra computations , 2014, SPAA.
[207] Michele Scquizzato,et al. Communication Lower Bounds for Distributed-Memory Computations , 2013, STACS.
[208] J. Demmel,et al. Tradeoffs between synchronization , communication , and work in parallel linear algebra computations , 2014 .
[209] James Demmel,et al. A Residual Replacement Strategy for Improving the Maximum Attainable Accuracy of s-Step Krylov Subspace Methods , 2014, SIAM J. Matrix Anal. Appl..
[210] Samuel Williams,et al. s-Step Krylov Subspace Methods as Bottom Solvers for Geometric Multigrid , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.
[211] James Demmel,et al. Reconstructing Householder Vectors from Tall-Skinny QR , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.
[212] James Demmel,et al. Communication Avoiding Rank Revealing QR Factorization with Column Pivoting , 2015, SIAM J. Matrix Anal. Appl..
[213] James Demmel,et al. Avoiding Communication in Successive Band Reduction , 2015, ACM Trans. Parallel Comput..
[214] W. Jalbyf,et al. STABILITY ANALYSIS AND IMPROVEMENT OF THE BLOCK GRAM-SCHMIDT ALGORITHM , .