Red-blue pebbling revisited: near optimal parallel matrix-matrix multiplication
暂无分享,去创建一个
Torsten Hoefler | Joost VandeVondele | Maciej Besta | Raffaele Solcà | Grzegorz Kwasniewski | Marko Kabic | T. Hoefler | Grzegorz Kwasniewski | J. VandeVondele | Maciej Besta | R. Solcà | Marko Kabic
[1] Ramesh C. Agarwal,et al. A three-dimensional approach to parallel matrix multiplication , 1995, IBM J. Res. Dev..
[2] V. Strassen. Gaussian elimination is not optimal , 1969 .
[3] Robert A. van de Geijn,et al. Pushing the Bounds for Matrix-Matrix Multiplication , 2017, ArXiv.
[4] Torsten Hoefler,et al. Enabling highly-scalable remote memory access programming with MPI-3 one sided , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[5] John Cocke,et al. Register Allocation Via Coloring , 1981, Comput. Lang..
[6] John R. Gilbert,et al. Parallel Triangle Counting and Enumeration Using Matrix Algebra , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium Workshop.
[7] Michael Wolfe,et al. More iteration space tiling , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).
[8] Jaeyoung Choi,et al. Pumma: Parallel universal matrix multiplication algorithms on distributed memory concurrent computers , 1994, Concurr. Pract. Exp..
[9] Peter Messmer,et al. Enabling simulation at the fifth rung of DFT: Large scale RPA calculations with excellent time to solution , 2015, Comput. Phys. Commun..
[10] Françoise Chatelin. Eigenvalues of Matrices: Revised Edition , 2012 .
[11] Dror Irony,et al. Communication lower bounds for distributed-memory matrix multiplication , 2004, J. Parallel Distributed Comput..
[12] Michael A. Bender,et al. Optimal Sparse Matrix Dense Vector Multiplication in the I/O-Model , 2007, SPAA '07.
[13] Torsten Hoefler,et al. SlimSell: A Vectorizable Graph Representation for Breadth-First Search , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).
[14] Siu Man Chan. Just a Pebble Game , 2013, 2013 IEEE Conference on Computational Complexity.
[15] Charles E. Leiserson,et al. Cache-Oblivious Algorithms , 2003, CIAC.
[16] Torsten Hoefler,et al. Remote Memory Access Programming in MPI-3 , 2015, TOPC.
[17] Jaeyoung Choi,et al. Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines , 1994, Sci. Program..
[18] Oded Schwartz,et al. Matrix Multiplication I/O-Complexity by Path Routing , 2015, SPAA.
[19] Jeffrey S. Vetter,et al. Statistical scalability analysis of communication operations in distributed applications , 2001, PPoPP '01.
[20] Torsten Hoefler,et al. Scaling Betweenness Centrality using Communication-Efficient Sparse Matrix Multiplication , 2016, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.
[21] Hyuk-Jae Lee,et al. Generalized Cannon's algorithm for parallel matrix multiplication , 1997, ICS '97.
[22] Lynn Elliot Cannon,et al. A cellular computer to implement the kalman filter algorithm , 1969 .
[23] Alain Darte. On the Complexity of Loop Fusion , 2000, Parallel Comput..
[24] Alexandru Nicolau,et al. Using Recursion to Boost ATLAS's Performance , 2005, ISHPC.
[25] James Demmel,et al. Trade-Offs Between Synchronization, Communication, and Computation in Parallel Linear Algebra Computations , 2016 .
[26] John D. Lafferty,et al. Convergence Analysis for Rectangular Matrix Completion Using Burer-Monteiro Factorization and Gradient Descent , 2016, ArXiv.
[27] James Demmel,et al. Cyclops Tensor Framework: Reducing Communication and Eliminating Load Imbalance in Massively Parallel Contractions , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.
[28] M. A. Bender. Optimal sparse matrix dense vector multiplication in the I/O-model , 2006 .
[29] Ravi Sethi,et al. Complete register allocation problems , 1973, SIAM J. Comput..
[30] Vipin Kumar,et al. Scalability of Parallel Algorithms for Matrix Multiplication , 1993, 1993 International Conference on Parallel Processing - ICPP'93.
[31] Michael I. Jordan,et al. On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.
[32] Nicholas J. Higham,et al. Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.
[33] Robert A. van de Geijn,et al. SUMMA: scalable universal matrix multiplication algorithm , 1995, Concurr. Pract. Exp..
[34] John Cocke,et al. A methodology for the real world , 1981 .
[35] James Demmel,et al. Communication-Optimal Parallel Recursive Rectangular Matrix Multiplication , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.
[36] Dingwen Tao,et al. TSM2: optimizing tall-and-skinny matrix-matrix multiplication on GPUs , 2019, ICS.
[37] James Demmel,et al. ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers - Design Issues and Performance , 1995, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.
[38] Sraban Kumar Mohanty. I/O Efficient Algorithms for Matrix Computations , 2010, ArXiv.
[39] James Demmel,et al. Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms , 2011, Euro-Par.
[40] Jack J. Dongarra,et al. Efficient implementation of quantum materials simulations on distributed CPU-GPU systems , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.
[41] Quanquan C. Liu. Red-blue and standard pebble games : complexity and applications in the sequential and parallel models , 2017 .
[42] Franz Franchetti,et al. Mathematical foundations of the GraphBLAS , 2016, 2016 IEEE High Performance Extreme Computing Conference (HPEC).
[43] G. C. Fox,et al. Solving Problems on Concurrent Processors , 1988 .
[44] Alok Aggarwal,et al. The input/output complexity of sorting and related problems , 1988, CACM.
[45] H. T. Kung,et al. I/O complexity: The red-blue pebble game , 1981, STOC '81.
[46] Xin-She Yang,et al. Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.
[47] Alfio Lazzaro,et al. Increasing the Efficiency of Sparse Matrix-Matrix Multiplication with a 2.5D Algorithm and One-Sided MPI , 2017, PASC.
[48] Monica S. Lam,et al. A data locality optimizing algorithm , 1991, PLDI '91.
[49] Jack Dongarra,et al. ScaLAPACK: a scalable linear algebra library for distributed memory concurrent computers , 1992, [Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation.
[50] H. Whitney,et al. An inequality related to the isoperimetric inequality , 1949 .
[51] Carl D. Meyer,et al. Matrix Analysis and Applied Linear Algebra , 2000 .
[52] Gero Greiner,et al. Sparse Matrix Computations and their I/O Complexity , 2012 .
[53] James Demmel,et al. Tradeoffs between synchronization, communication, and computation in parallel linear algebra computations , 2014, SPAA.
[54] Matteo Frigo,et al. An analysis of dag-consistent distributed shared-memory algorithms , 1996, SPAA '96.
[55] Torsten Hoefler,et al. Demystifying Parallel and Distributed Deep Learning , 2018, ACM Comput. Surv..
[56] Sivan Toledo,et al. A survey of out-of-core algorithms in numerical linear algebra , 1999, External Memory Algorithms.
[57] Robert E. Tarjan,et al. The pebbling problem is complete in polynomial space , 1979, SIAM J. Comput..
[58] Leonid Oliker,et al. Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).
[59] Geoffrey C. Fox,et al. Matrix algorithms on a hypercube I: Matrix multiplication , 1987, Parallel Comput..
[60] Ken Kennedy,et al. Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution , 1993, LCPC.