Chapter 9 Communication Avoiding ( CA ) and Other Innovative Algorithms
暂无分享,去创建一个
[1] Sartaj Sahni,et al. Parallel Matrix and Graph Algorithms , 1981, SIAM J. Comput..
[2] Y. Saad,et al. GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems , 1986 .
[3] H. Walker. Implementation of the GMRES method using householder transformations , 1988 .
[4] Katherine Yelick,et al. OSKI: A library of automatically tuned sparse matrix kernels , 2005 .
[5] Dror Irony,et al. Communication lower bounds for distributed-memory matrix multiplication , 2004, J. Parallel Distributed Comput..
[6] D. Hut. A Newton Basis Gmres Implementation , 1991 .
[7] James Demmel,et al. Applied Numerical Linear Algebra , 1997 .
[8] Erik Elmroth,et al. Applying recursion to serial and parallel QR factorization leads to better performance , 2000, IBM J. Res. Dev..
[9] James Demmel,et al. CALU: A Communication Optimal LU Factorization Algorithm , 2011, SIAM J. Matrix Anal. Appl..
[10] Jocelyne Erhel,et al. A parallel GMRES version for general sparse matrices. , 1995 .
[11] James Demmel,et al. Minimizing communication in sparse matrix solvers , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.
[12] Jeremy D. Frens,et al. QR factorization with Morton-ordered quadtree matrices for memory re-use and parallelism , 2003, PPoPP '03.
[13] Yousef Saad,et al. Iterative methods for sparse linear systems , 2003 .
[14] A. George. Nested Dissection of a Regular Finite Element Mesh , 1973 .
[15] D. Rose,et al. Complexity Bounds for Regular Finite Difference and Finite Element Grids , 1973 .
[16] Katherine Yelick,et al. Optimizing collective communication on multicores , 2009 .
[17] M. F.,et al. Bibliography , 1985, Experimental Gerontology.
[18] James Demmel,et al. Communication-optimal Parallel and Sequential Cholesky Decomposition , 2009, SIAM J. Sci. Comput..
[19] James Demmel,et al. Avoiding communication in sparse matrix computations , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.
[20] James Demmel,et al. LU Factorization with Panel Rank Revealing Pivoting and Its Communication Avoiding Version , 2012, SIAM J. Matrix Anal. Appl..
[21] James Demmel,et al. Brief announcement: Lower bounds on communication for sparse Cholesky factorization of a model problem , 2010, SPAA '10.
[22] Robert A. van de Geijn,et al. FLAME: Formal Linear Algebra Methods Environment , 2001, TOMS.
[23] Christian H. Bischof,et al. A framework for symmetric band reduction , 2000, TOMS.
[24] Karen S. Braman,et al. The Multishift QR Algorithm. Part II: Aggressive Early Deflation , 2001, SIAM J. Matrix Anal. Appl..
[25] James Demmel,et al. Exploiting Data Sparsity in Parallel Matrix Powers Computations , 2013, PPAM.
[26] Charles E. Leiserson,et al. Cache-Oblivious Algorithms , 2003, CIAC.
[27] James Demmel,et al. Minimizing Communication in Linear Algebra , 2009, ArXiv.
[28] H. T. Kung,et al. I/O complexity: The red-blue pebble game , 1981, STOC '81.
[29] Keshav Pingali,et al. Automatic Generation of Block-Recursive Codes , 2000, Euro-Par.
[30] C. Puglisi. Modification of the householder method based on the compact WY representation , 1992 .
[31] G. Golub,et al. Parallel block schemes for large-scale least-squares computations , 1988 .
[32] Robert A. van de Geijn,et al. Parallel out-of-core computation and updating of the QR factorization , 2005, TOMS.
[33] Fred G. Gustavson,et al. Recursion leads to automatic variable blocking for dense linear-algebra algorithms , 1997, IBM J. Res. Dev..
[34] E. Sturler. A PARALLEL VARIANT OF GMRES(m) , 1991 .
[35] Alok Aggarwal,et al. The input/output complexity of sorting and related problems , 1988, CACM.
[36] James Demmel,et al. Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms , 2011, Euro-Par.
[37] Jack Dongarra,et al. LAPACK's user's guide , 1992 .
[38] Samuel Williams,et al. Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).
[39] Raphael Yuster,et al. Fast sparse matrix multiplication , 2004, TALG.
[40] Ronald B. Morgan,et al. Implicitly Restarted GMRES and Arnoldi Methods for Nonsymmetric Systems of Equations , 2000, SIAM J. Matrix Anal. Appl..
[41] C. Loan,et al. A Storage-Efficient $WY$ Representation for Products of Householder Transformations , 1989 .
[42] James Demmel,et al. Perfect Strong Scaling Using No Additional Energy , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.
[43] Dror Irony,et al. Trading Replication for Communication in Parallel Distributed-Memory Dense Solvers , 2002, Parallel Process. Lett..
[44] James Demmel,et al. Communication-Avoiding Parallel Strassen: Implementation and performance , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.
[45] Sivan Toledo. Locality of Reference in LU Decomposition with Partial Pivoting , 1997 .
[46] Richard W. Vuduc,et al. Sparsity: Optimization Framework for Sparse Matrix Kernels , 2004, Int. J. High Perform. Comput. Appl..
[47] Robert H. Halstead,et al. Matrix Computations , 2011, Encyclopedia of Parallel Computing.
[48] Erik Elmroth,et al. SIAM REVIEW c ○ 2004 Society for Industrial and Applied Mathematics Vol. 46, No. 1, pp. 3–45 Recursive Blocked Algorithms and Hybrid Data Structures for Dense Matrix Library Software ∗ , 2022 .
[49] David A. Patterson,et al. Direction-optimizing Breadth-First Search , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.
[50] J. Demmel,et al. Avoiding Communication in Computing Krylov Subspaces , 2007 .
[51] Xin-She Yang,et al. Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.
[52] James Demmel,et al. A Residual Replacement Strategy for Improving the Maximum Attainable Accuracy of s-Step Krylov Subspace Methods , 2014, SIAM J. Matrix Anal. Appl..
[53] Robert A. van de Geijn,et al. PLAPACK: Parallel Linear Algebra Package , 1997, PPSC.
[54] Richard Vuduc,et al. Automatic performance tuning of sparse matrix kernels , 2003 .
[55] James Demmel,et al. Fast linear algebra is stable , 2006, Numerische Mathematik.
[56] Christian H. Bischof,et al. The WY representation for products of householder matrices , 1985, PPSC.
[57] Fred G. Gustavson,et al. A recursive formulation of Cholesky factorization of a matrix in packed storage , 2001, TOMS.
[58] James Demmel,et al. Nonnegative Diagonals and High Performance on Low-Profile Matrices from Householder QR , 2009, SIAM J. Sci. Comput..
[59] Marc Snir,et al. GETTING UP TO SPEED THE FUTURE OF SUPERCOMPUTING , 2004 .
[60] H. Whitney,et al. An inequality related to the isoperimetric inequality , 1949 .
[61] James Demmel,et al. Avoiding Communication in Two-Sided Krylov Subspace Methods , 2011 .
[62] James Demmel,et al. Communication Avoiding Gaussian elimination , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.
[63] W. Joubert,et al. Parallelizable restarted iterative methods for nonsymmetric linear systems. part I: Theory , 1992 .
[64] Robert A. van de Geijn,et al. Anatomy of high-performance matrix multiplication , 2008, TOMS.
[65] James Demmel,et al. Communication-Optimal Parallel Recursive Rectangular Matrix Multiplication , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.
[66] Julien Langou,et al. A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures , 2007, Parallel Comput..
[67] James Demmel,et al. Communication avoiding successive band reduction , 2012, PPoPP '12.
[68] Mark Hoemmen,et al. A Communication-Avoiding, Hybrid-Parallel, Rank-Revealing Orthogonalization Method , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.
[69] Vipin Kumar,et al. A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..
[70] Samuel H. Fuller,et al. The Future of Computing Performance: Game Over or Next Level? , 2014 .
[71] Y. Saad,et al. Communication complexity of the Gaussian elimination algorithm on multiprocessors , 1986 .
[72] J. Cullum,et al. A block Lanczos algorithm for computing the q algebraically largest eigenvalues and a corresponding eigenspace of large, sparse, real symmetric matrices , 1974, CDC 1974.
[73] Cleve Ashc Raft. The fan-both family of column-based distributed Cholesky factorization algorithms , 1993 .
[74] Vijaya Ramachandran,et al. Cache-oblivious dynamic programming , 2006, SODA '06.
[75] Sivan Toledo,et al. Quantitative performance modeling of scientific computations and creating locality in numerical algorithms , 1995 .
[76] Samuel Williams,et al. Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.
[77] James Demmel,et al. Communication-Avoiding Krylov Techniques on Graphic Processing Units , 2013, IEEE Transactions on Magnetics.
[78] James Demmel,et al. Avoiding Communication in Successive Band Reduction , 2015, ACM Trans. Parallel Comput..
[79] Michael A. Bender,et al. Optimal Sparse Matrix Dense Vector Multiplication in the I/O-Model , 2007, SPAA '07.
[80] John E. Savage. Extending the Hong-Kung Model to Memory Hierarchies , 1995, COCOON.
[81] Karen S. Braman,et al. The Multishift QR Algorithm. Part I: Maintaining Well-Focused Shifts and Level 3 Performance , 2001, SIAM J. Matrix Anal. Appl..
[82] James Demmel,et al. Implementing a Blocked Aasen's Algorithm with a Dynamic Scheduler on Multicore Architectures , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.
[83] Christian H. Bischof,et al. Algorithm 807: The SBR Toolbox—software for successive band reduction , 2000, TOMS.
[84] D. Patterson,et al. Searching for a Parent Instead of Fighting Over Children : A Fast Breadth-First Search Implementation for Graph 500 , 2011 .
[85] R. Tarjan,et al. The analysis of a nested dissection algorithm , 1987 .
[86] Mark Hoemmen,et al. Communication-avoiding Krylov subspace methods , 2010 .
[87] Erik Elmroth,et al. New Serial and Parallel Recursive QR Factorization Algorithms for SMP Systems , 1998, PARA.
[88] Katherine A. Yelick,et al. A Communication-Optimal N-Body Algorithm for Direct Interactions , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.
[89] J. Bunch,et al. Some stable methods for calculating inertia and solving symmetric linear systems , 1977 .
[90] Viktor K. Prasanna,et al. Optimizing graph algorithms for improved cache performance , 2002, IEEE Transactions on Parallel and Distributed Systems.
[91] Ramesh C. Agarwal,et al. A three-dimensional approach to parallel matrix multiplication , 1995, IBM J. Res. Dev..
[92] James Demmel,et al. Communication-optimal Parallel and Sequential QR and LU Factorizations , 2008, SIAM J. Sci. Comput..
[93] Lynn Elliot Cannon,et al. A cellular computer to implement the kalman filter algorithm , 1969 .
[94] James Demmel,et al. Communication efficient gaussian elimination with partial pivoting using a shape morphing data layout , 2013, SPAA.
[95] W. Marsden. I and J , 2012 .
[96] James Demmel,et al. Communication lower bounds and optimal algorithms for programs that reference arrays - Part 1 , 2013, ArXiv.
[97] Alexander Tiskin,et al. Memory-Efficient Matrix Multiplication in the BSP Model , 1999, Algorithmica.
[98] Eric de Sturler,et al. Recycling Krylov Subspaces for Sequences of Linear Systems , 2006, SIAM J. Sci. Comput..
[99] D. O’Leary. The block conjugate gradient algorithm and related methods , 1980 .