Learning from Optimizing Matrix-Matrix Multiplication
暂无分享,去创建一个
Robert A. van de Geijn | Jianyu Huang | Devangi N. Parikh | Margaret E. Myers | R. Geijn | Jianyu Huang | D. Parikh
[1] Martin D. Schatz,et al. Parallel Matrix Multiplication: A Systematic Journey , 2016, SIAM J. Sci. Comput..
[2] Jianyu Huang,et al. Performance optimization for the k-nearest neighbors kernel on x86 architectures , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.
[3] Fillia Makedon,et al. Teaching Parallel Computing to Freshmen , 1994 .
[4] Tze Meng Low,et al. The BLIS Framework , 2016 .
[5] Robert A. van de Geijn,et al. BLISlab: A Sandbox for Optimizing GEMM , 2016, ArXiv.
[6] Jack J. Dongarra,et al. A set of level 3 basic linear algebra subprograms , 1990, TOMS.
[7] Charles L. Lawson,et al. Basic Linear Algebra Subprograms for Fortran Usage , 1979, TOMS.
[8] Robert A. van de Geijn,et al. SUMMA: scalable universal matrix multiplication algorithm , 1995, Concurr. Pract. Exp..
[9] Robert A. van de Geijn,et al. Collective communication: theory, practice, and experience , 2007, Concurr. Comput. Pract. Exp..
[10] D LamMonica,et al. The cache performance and optimizations of blocked algorithms , 1991 .
[11] Robert A. van de Geijn,et al. Generating Families of Practical Fast Matrix Multiplication Algorithms , 2016, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).
[12] Tze Meng Low,et al. Analytical Modeling Is Enough for High-Performance BLIS , 2016, ACM Trans. Math. Softw..
[13] Robert A. van de Geijn,et al. BLIS: A Framework for Rapidly Instantiating BLAS Functionality , 2015, ACM Trans. Math. Softw..
[14] L. Dagum,et al. OpenMP: an industry standard API for shared-memory programming , 1998 .
[15] Field G. Van Zee,et al. Implementing High-performance Complex Matrix Multiplication via the 3m and 4m Methods , 2017, ACM Trans. Math. Softw..
[16] Robert A. van de Geijn,et al. Anatomy of high-performance matrix multiplication , 2008, TOMS.
[17] Paolo Bientinesi,et al. Design of a High-Performance GEMM-like Tensor–Tensor Multiplication , 2016, ACM Trans. Math. Softw..
[18] Christopher H. Nevison. Parallel Computing in the Undergraduate Curriculum , 1995, Computer.
[19] James Demmel,et al. Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.
[20] Robert A. van de Geijn,et al. Strassen's Algorithm for Tensor Contraction , 2017, SIAM J. Sci. Comput..
[21] Jack J. Dongarra,et al. Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.
[22] Robert A. van de Geijn,et al. Using PLAPACK - parallel linear algebra package , 1997 .
[23] Jack J. Dongarra,et al. An extended set of FORTRAN basic linear algebra subprograms , 1988, TOMS.
[24] Franz Franchetti,et al. How to Write Fast Numerical Code: A Small Introduction , 2007, GTTSE.
[25] Robert A. van de Geijn,et al. Anatomy of High-Performance Many-Threaded Matrix Multiplication , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.
[26] Devin A. Matthews,et al. High-Performance Tensor Contraction without Transposition , 2016, SIAM J. Sci. Comput..
[27] Marsha Meredith. Introducing parallel computing into the undergraduate computer science curriculum: a progress report , 1992, SIGCSE '92.
[28] Jie Wu,et al. NSF/IEEE-TCPP curriculum initiative on parallel and distributed computing: core topics for undergraduates , 2011, SIGCSE '11.
[29] Robert A. van de Geijn,et al. Elemental: A New Framework for Distributed Memory Dense Matrix Computations , 2013, TOMS.
[30] Robert A. van de Geijn,et al. Strassen's Algorithm Reloaded , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.
[31] Monica S. Lam,et al. The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.