Avoiding Communication in Dense Linear Algebra
暂无分享,去创建一个
[1] K. Murata,et al. A New Method for the Tridiagonalization of the Symmetric Band Matrix , 1975 .
[2] James Demmel,et al. Implementing a Blocked Aasen's Algorithm with a Dynamic Scheduler on Multicore Architectures , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.
[3] James Demmel,et al. Brief announcement: strong scaling of matrix multiplication algorithms and memory-independent communication lower bounds , 2012, SPAA '12.
[4] Michael A. Heroux,et al. GEMMW: A Portable Level 3 BLAS Winograd Variant of Strassen's Matrix-Matrix Multiply Algorithm , 1994, Journal of Computational Physics.
[5] John E. Savage. Extending the Hong-Kung Model to Memory Hierarchies , 1995, COCOON.
[6] James Demmel,et al. Fast linear algebra is stable , 2006, Numerische Mathematik.
[7] James Demmel,et al. Avoiding Communication in Successive Band Reduction , 2015, ACM Trans. Parallel Comput..
[8] James Demmel,et al. Communication-optimal Parallel and Sequential QR and LU Factorizations , 2008, SIAM J. Sci. Comput..
[9] James Demmel,et al. Fast matrix multiplication is stable , 2006, Numerische Mathematik.
[10] Stephen Warshall,et al. A Theorem on Boolean Matrices , 1962, JACM.
[11] James Demmel,et al. IEEE Standard for Floating-Point Arithmetic , 2008 .
[12] James Demmel,et al. LU Factorization with Panel Rank Revealing Pivoting and Its Communication Avoiding Version , 2012, SIAM J. Matrix Anal. Appl..
[13] Ramesh Subramonian,et al. LogP: a practical model of parallel computation , 1996, CACM.
[14] Sivan Toledo. Locality of Reference in LU Decomposition with Partial Pivoting , 1997, SIAM J. Matrix Anal. Appl..
[15] Robert A. van de Geijn,et al. Parallel out-of-core computation and updating of the QR factorization , 2005, TOMS.
[16] Christian H. Bischof,et al. The WY representation for products of householder matrices , 1985, PPSC.
[17] Alexander Tiskin,et al. Memory-Efficient Matrix Multiplication in the BSP Model , 1999, Algorithmica.
[18] G. Golub,et al. Parallel block schemes for large-scale least-squares computations , 1988 .
[19] Nicholas J. Higham,et al. INVERSE PROBLEMS NEWSLETTER , 1991 .
[20] Nicholas J. Higham,et al. Stable and Efficient Spectral Divide and Conquer Algorithms for the Symmetric Eigenvalue Decomposition and the SVD , 2013, SIAM J. Sci. Comput..
[21] Lars Karlsson,et al. Parallel two-stage reduction to Hessenberg form using dynamic scheduling on shared-memory architectures , 2011, Parallel Comput..
[22] James Demmel,et al. Brief announcement: Lower bounds on communication for sparse Cholesky factorization of a model problem , 2010, SPAA '10.
[23] J. Demmel,et al. An inverse free parallel spectral divide and conquer algorithm for nonsymmetric eigenproblems , 1997 .
[24] Robert B. Wilhelmson. High-speed computing: scientific applications and algorithm design , 1988 .
[25] James Demmel,et al. Communication-Avoiding QR Decomposition for GPUs , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.
[26] Christian H. Bischof,et al. Parallel Bandreduction and Tridiagonalization , 1993, PPSC.
[27] Jack Dongarra,et al. Experiments with Strassen's Algorithm: From Sequential to Parallel , 2006 .
[28] Fred G. Gustavson,et al. Recursion leads to automatic variable blocking for dense linear-algebra algorithms , 1997, IBM J. Res. Dev..
[29] Yi Ma,et al. Robust principal component analysis? , 2009, JACM.
[30] Alok Aggarwal,et al. The input/output complexity of sorting and related problems , 1988, CACM.
[31] Alexander Tiskin. Communication-efficient parallel generic pairwise elimination , 2007, Future Gener. Comput. Syst..
[32] Ran Raz. On the Complexity of Matrix Product , 2003, SIAM J. Comput..
[33] H. T. Kung,et al. I/O complexity: The red-blue pebble game , 1981, STOC '81.
[34] Grazia Lotti,et al. O(n2.7799) Complexity for n*n Approximate Matrix Multiplication , 1979, Inf. Process. Lett..
[35] Yuefan Deng,et al. New trends in high performance computing , 2001, Parallel Computing.
[36] B. S. Garbow,et al. Matrix Eigensystem Routines — EISPACK Guide , 1974, Lecture Notes in Computer Science.
[37] Robert A. van de Geijn,et al. FLAME: Formal Linear Algebra Methods Environment , 2001, TOMS.
[38] James Hardy Wilkinson,et al. Reduction of the symmetric eigenproblemAx=λBx and related problems to standard form , 1968 .
[39] Robert A. van de Geijn,et al. A High Performance Parallel Strassen Implementation , 1995, Parallel Process. Lett..
[40] J. Demmel. An arithmetic complexity lower bound for computing rational functions, with applications to linear algebra , 2013 .
[41] Bruno Lang,et al. A Parallel Algorithm for Reducing Symmetric Banded Matrices to Tridiagonal Form , 1993, SIAM J. Sci. Comput..
[42] Greg Henry,et al. Application of a High Performance Parallel Eigensolver to Electronic Structure Calculations , 1998, Proceedings of the IEEE/ACM SC98 Conference.
[43] Robert A. van de Geijn,et al. Elemental: A New Framework for Distributed Memory Dense Matrix Computations , 2013, TOMS.
[44] James Demmel,et al. Communication Avoiding Rank Revealing QR Factorization with Column Pivoting , 2015, SIAM J. Matrix Anal. Appl..
[45] Jack J. Dongarra,et al. Two-Stage Tridiagonal Reduction for Dense Symmetric Matrices Using Tile Algorithms on Multicore Architectures , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.
[46] Marc Snir,et al. GETTING UP TO SPEED THE FUTURE OF SUPERCOMPUTING , 2004 .
[47] P. Sadayappan,et al. A tensor product formulation of Strassen's matrix multiplication algorithm with memory reduction , 1993, [1993] Proceedings Seventh International Parallel Processing Symposium.
[48] Christian H. Bischof,et al. Algorithm 807: The SBR Toolbox—software for successive band reduction , 2000, TOMS.
[49] J. Bunch,et al. Some stable methods for calculating inertia and solving symmetric linear systems , 1977 .
[50] C. Puglisi. Modification of the householder method based on the compact WY representation , 1992 .
[51] Thomas Auckenthaler,et al. Highly scalable eigensolvers for petaflop applications , 2012 .
[52] L. Trefethen,et al. Average-case stability of Gaussian elimination , 1990 .
[53] C. Loan,et al. A Storage-Efficient $WY$ Representation for Products of Householder Transformations , 1989 .
[54] Robert A. van de Geijn,et al. SUMMA: Scalable Universal Matrix Multiplication Algorithm , 1995 .
[55] Dror Irony,et al. Communication lower bounds for distributed-memory matrix multiplication , 2004, J. Parallel Distributed Comput..
[56] Shmuel Winograd,et al. On multiplication of 2 × 2 matrices , 1971 .
[57] James Demmel,et al. Communication-optimal parallel algorithm for strassen's matrix multiplication , 2012, SPAA '12.
[58] Jack J. Dongarra,et al. Parallel reduction to condensed forms for symmetric eigenvalue problems using aggregated fine-grained and memory-aware kernels , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[59] Ramesh C. Agarwal,et al. A three-dimensional approach to parallel matrix multiplication , 1995, IBM J. Res. Dev..
[60] Guy E. Blelloch,et al. Provably good multicore cache performance for divide-and-conquer algorithms , 2008, SODA '08.
[61] R. Tarjan,et al. The analysis of a nested dissection algorithm , 1987 .
[62] James Demmel,et al. Communication-Avoiding Parallel Strassen: Implementation and performance , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.
[63] Jack J. Dongarra,et al. Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..
[64] A. Tiskin. Bulk-Synchronous Parallel Gaussian Elimination , 2002 .
[65] H. Schwarz. Tridiagonalization of a symetric band matrix , 1968 .
[66] Matteo Frigo,et al. Cache-oblivious algorithms , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).
[67] Thomas Rauber,et al. Combining building blocks for parallel multi-level matrix multiplication , 2008, Parallel Comput..
[68] D. Rose,et al. Complexity Bounds for Regular Finite Difference and Finite Element Grids , 1973 .
[69] Alok Aggarwal,et al. Communication Complexity of PRAMs , 1990, Theor. Comput. Sci..
[70] James Demmel,et al. LAPACK Users' Guide, Third Edition , 1999, Software, Environments and Tools.
[71] Jarle Berntsen,et al. Communication efficient matrix multiplication on hypercubes , 1989, Parallel Comput..
[72] James Demmel,et al. Minimizing communication in sparse matrix solvers , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.
[73] Leslie G. Valiant,et al. A bridging model for multi-core computing , 2008, J. Comput. Syst. Sci..
[74] N. Higham. Notes on Accuracy and Stability of Algorithms in Numerical Linear Algebra , 1999 .
[75] V. Strassen. Gaussian elimination is not optimal , 1969 .
[76] G. Miller. On the Solution of a System of Linear Equations , 1910 .
[77] Sraban Kumar Mohanty,et al. I/O efficient QR and QZ algorithms , 2012, 2012 19th International Conference on High Performance Computing.
[78] James Demmel,et al. Communication-Optimal Parallel Recursive Rectangular Matrix Multiplication , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.
[79] James Demmel,et al. Communication efficient gaussian elimination with partial pivoting using a shape morphing data layout , 2013, SPAA.
[80] James Demmel,et al. CALU: A Communication Optimal LU Factorization Algorithm , 2011, SIAM J. Matrix Anal. Appl..
[81] Erik Elmroth,et al. New Serial and Parallel Recursive QR Factorization Algorithms for SMP Systems , 1998, PARA.
[82] James Demmel,et al. Graph Expansion Analysis for Communication Costs of Fast Rectangular Matrix Multiplication , 2012, MedAlg.
[83] James Demmel,et al. Communication avoiding successive band reduction , 2012, PPoPP '12.
[84] James Demmel,et al. Applied Numerical Linear Algebra , 1997 .
[85] A. George. Nested Dissection of a Regular Finite Element Mesh , 1973 .
[86] Lukas Krämer,et al. Parallel solution of partial symmetric eigenvalue problems from electronic structure calculations , 2011, Parallel Comput..
[87] Julien Langou,et al. A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures , 2007, Parallel Comput..
[88] F. V. Zee. Restructuring the QR Algorithm for Performance , 2011 .
[89] Viktor K. Prasanna,et al. Optimizing graph algorithms for improved cache performance , 2004, Proceedings 16th International Parallel and Distributed Processing Symposium.
[90] Jack Dongarra,et al. Computational Science: Ensuring America's Competitiveness , 2005 .
[91] James Demmel,et al. Communication optimal parallel multiplication of sparse random matrices , 2013, SPAA.
[92] Linda Kaufman,et al. The retraction algorithm for factoring banded symmetric matrices , 2007, Numer. Linear Algebra Appl..
[93] Leslie G. Valiant,et al. A bridging model for parallel computation , 1990, CACM.
[94] Frédéric Suter,et al. Impact of mixed-parallelism on parallel implementations of the Strassen and Winograd matrix multiplication algorithms: Research Articles , 2004 .
[95] Shang-Hua Teng,et al. Smoothed Analysis of the Condition Numbers and Growth Factors of Matrices , 2003, SIAM J. Matrix Anal. Appl..
[96] Mark Ainsworth,et al. The graduate student's guide to numerical analysis '98 : lecture notes from the VIII EPSRC Summer School in Numerical Analysis , 1999 .
[97] Mohammad Zubair,et al. A unified model for multicore architectures , 2008, IFMT '08.
[98] Gil Shklarski,et al. Partitioned Triangular Tridiagonalization , 2011, TOMS.
[99] J. O. Aasen. On the reduction of a symmetric matrix to tridiagonal form , 1971 .
[100] Christian H. Bischof,et al. A framework for symmetric band reduction , 2000, TOMS.
[101] Joseph F. Traub. Complexity of Sequential and Parallel Numerical Algorithms , 1973 .
[102] Samuel Williams,et al. The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .
[103] Butler W. Lampson,et al. Annual Review of Computer Science , 1986 .
[104] Linda Kaufman. Band reduction algorithms revisited , 2000, TOMS.
[105] Jeremy D. Frens,et al. QR factorization with Morton-ordered quadtree matrices for memory re-use and parallelism , 2003, PPoPP '03.
[106] D. Eppstein,et al. Parallel Algorithmic Techniques for Combinatorial Computation , 1988 .
[107] Ramesh C. Agarwal,et al. A high-performance matrix-multiplication algorithm on a distributed-memory parallel computer, using overlapped communication , 1994, IBM J. Res. Dev..
[108] T. H Axford. Annual review of computer science. Volume 4, 1989–1990 , 1991 .
[109] Daniel Kressner,et al. Algorithm 953 , 2015 .
[110] Keshav Pingali,et al. An experimental comparison of cache-oblivious and cache-conscious programs , 2007, SPAA '07.
[111] C. H. Bischof,et al. A framework for symmetric band reduction and tridiagonalization , 1994 .
[112] James Demmel,et al. Minimizing Communication in All-Pairs Shortest Paths , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.
[113] Sivan Toledo,et al. THE SNAP-BACK PIVOTING METHOD FOR SYMMETRIC , 2006 .
[114] Robert A. van de Geijn,et al. SUMMA: scalable universal matrix multiplication algorithm , 1995, Concurr. Pract. Exp..
[115] James Demmel,et al. Graph expansion and communication costs of fast matrix multiplication: regular submission , 2011, SPAA '11.
[116] Linda Kaufman,et al. Banded Eigenvalue Solvers on Vector Machines , 1984, TOMS.
[117] Qingshan Luo,et al. A scalable parallel Strassen's matrix multiplication algorithm for distributed-memory computers , 1995, SAC '95.
[118] James Demmel,et al. Communication-Avoiding Symmetric-Indefinite Factorization , 2014, SIAM J. Matrix Anal. Appl..
[119] Benjamin Lipshitz,et al. Communication-Avoiding Parallel Recursive Algorithms for Matrix Multiplication , 2013 .
[120] Keshav Pingali,et al. Automatic Generation of Block-Recursive Codes , 2000, Euro-Par.
[121] James Demmel,et al. Communication lower bounds and optimal algorithms for programs that reference arrays - Part 1 , 2013, ArXiv.
[122] James Hardy Wilkinson,et al. Householder's method for symmetric matrices , 1962 .
[123] Virginia Vassilevska Williams,et al. Multiplying matrices faster than coppersmith-winograd , 2012, STOC '12.
[124] James Demmel,et al. Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms , 2011, Euro-Par.
[125] David S. Wise. Ahnentafel Indexing into Morton-Ordered Arrays, or Matrix Locality for Free , 2000, Euro-Par.
[126] Timothy A. Davis,et al. The university of Florida sparse matrix collection , 2011, TOMS.
[127] Lynn Elliot Cannon,et al. A cellular computer to implement the kalman filter algorithm , 1969 .
[128] James Demmel,et al. Brief announcement: communication bounds for heterogeneous architectures , 2011, SPAA '11.
[129] James Demmel,et al. Communication-optimal Parallel and Sequential Cholesky Decomposition , 2009, SIAM J. Sci. Comput..
[130] H. Whitney,et al. An inequality related to the isoperimetric inequality , 1949 .
[131] Xiaobai Sun,et al. Parallel tridiagonalization through two-step band reduction , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.
[132] Yousef Saad,et al. Iterative methods for sparse linear systems , 2003 .
[133] Lukas Krämer,et al. Developing algorithms and software for the parallel solution of the symmetric eigenvalue problem , 2011, J. Comput. Sci..
[134] Katherine A. Yelick,et al. Communication avoiding and overlapping for numerical linear algebra , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.
[135] James Demmel,et al. Performance and Accuracy of LAPACK's Symmetric Tridiagonal Eigensolvers , 2008, SIAM J. Sci. Comput..
[136] Robert A. van de Geijn,et al. Collective communication: theory, practice, and experience , 2007, Concurr. Comput. Pract. Exp..
[137] James Demmel,et al. Minimizing Communication in Numerical Linear Algebra , 2009, SIAM J. Matrix Anal. Appl..
[138] H. Rutishauser. On jacobi rotation patterns , 1963 .
[139] Barton P. Miller,et al. Critical path analysis for the execution of parallel and distributed programs , 1988, [1988] Proceedings. The 8th International Conference on Distributed.
[140] L. R. Kerr,et al. On Minimizing the Number of Multiplications Necessary for Matrix Multiplication , 1969 .
[141] J. Demmel,et al. Sequential Communication Bounds for Fast Linear Algebra , 2012 .
[142] Jack Dongarra,et al. ScaLAPACK Users' Guide , 1987 .
[143] T. Tao,et al. Finite bounds for Hölder-Brascamp-Lieb multilinear inequalities , 2005, math/0505691.
[144] Sivasankaran Rajamanickam,et al. EFFICIENT ALGORITHMS FOR SPARSE SINGULAR VALUE DECOMPOSITION , 2009 .
[145] Gianfranco Bilardi,et al. A Lower Bound Technique for Communication on BSP with Application to the FFT , 2012, Euro-Par.