Communication-Avoiding Parallel Strassen: Implementation and performance
暂无分享,去创建一个
James Demmel | Oded Schwartz | Grey Ballard | Benjamin Lipshitz | J. Demmel | Grey Ballard | O. Schwartz | Benjamin Lipshitz
[1] Hans Werner Meuer,et al. Top500 Supercomputer Sites , 1997 .
[2] James Demmel,et al. Improving communication performance in dense linear algebra via topology aware collectives , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[3] Matteo Frigo,et al. Cache-oblivious algorithms , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).
[4] James Demmel,et al. Graph expansion and communication costs of fast matrix multiplication: regular submission , 2011, SPAA '11.
[5] Ramesh C. Agarwal,et al. A three-dimensional approach to parallel matrix multiplication , 1995, IBM J. Res. Dev..
[6] James Demmel,et al. Brief announcement: strong scaling of matrix multiplication algorithms and memory-independent communication lower bounds , 2012, SPAA '12.
[7] James Demmel,et al. Fast linear algebra is stable , 2006, Numerische Mathematik.
[8] John Shalf,et al. Exascale Computing Technology Challenges , 2010, VECPAR.
[9] Jarle Berntsen,et al. Communication efficient matrix multiplication on hypercubes , 1989, Parallel Comput..
[10] Lynn Elliot Cannon,et al. A cellular computer to implement the kalman filter algorithm , 1969 .
[11] James Demmel,et al. Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms , 2011, Euro-Par.
[12] Qingshan Luo,et al. A scalable parallel Strassen's matrix multiplication algorithm for distributed-memory computers , 1995, SAC '95.
[13] James Demmel,et al. Minimizing Communication in Numerical Linear Algebra , 2009, SIAM J. Matrix Anal. Appl..
[14] Dror Irony,et al. Communication lower bounds for distributed-memory matrix multiplication , 2004, J. Parallel Distributed Comput..
[15] James Demmel,et al. Communication-optimal parallel algorithm for strassen's matrix multiplication , 2012, SPAA '12.
[16] Guy E. Blelloch,et al. Effectively sharing a cache among threads , 2004, SPAA '04.
[17] David S. Wise,et al. Seven at one stroke: results from a cache-oblivious paradigm for scalable matrix algorithms , 2006, MSPC '06.
[18] Robert A. van de Geijn,et al. SUMMA: scalable universal matrix multiplication algorithm , 1995, Concurr. Pract. Exp..
[19] Alexander Tiskin,et al. Memory-Efficient Matrix Multiplication in the BSP Model , 1999, Algorithmica.
[20] Mei Han An,et al. accuracy and stability of numerical algorithms , 1991 .
[21] Robert A. van de Geijn,et al. A High Performance Parallel Strassen Implementation , 1995, Parallel Process. Lett..