Learning from Optimizing Matrix-Matrix Multiplication

We describe a learning process that uses one of the simplest examples, matrix-matrix multiplication, to illustrate issues that underlie parallel high-performance computing. It is accessible at multiple levels: simple enough to use early in a curriculum yet rich enough to benefit a more advanced software developer. A carefully designed and scaffolded set of exercises leads the learner from a naive implementation towards one that extracts parallelism at multiple levels, ranging from instruction level parallelism to multithreaded parallelism via OpenMP to distributed memory parallelism using MPI. The importance of effectively leveraging the memory hierarchy within and across nodes is exposed, as do the GotoBLAS and SUMMA algorithms. These materials will become part of a Massive Open Online Course (MOOC) to be offered in the future.

[1]  Martin D. Schatz,et al.  Parallel Matrix Multiplication: A Systematic Journey , 2016, SIAM J. Sci. Comput..

[2]  Jianyu Huang,et al.  Performance optimization for the k-nearest neighbors kernel on x86 architectures , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[3]  Fillia Makedon,et al.  Teaching Parallel Computing to Freshmen , 1994 .

[4]  Tze Meng Low,et al.  The BLIS Framework , 2016 .

[5]  Robert A. van de Geijn,et al.  BLISlab: A Sandbox for Optimizing GEMM , 2016, ArXiv.

[6]  Jack J. Dongarra,et al.  A set of level 3 basic linear algebra subprograms , 1990, TOMS.

[7]  Charles L. Lawson,et al.  Basic Linear Algebra Subprograms for Fortran Usage , 1979, TOMS.

[8]  Robert A. van de Geijn,et al.  SUMMA: scalable universal matrix multiplication algorithm , 1995, Concurr. Pract. Exp..

[9]  Robert A. van de Geijn,et al.  Collective communication: theory, practice, and experience , 2007, Concurr. Comput. Pract. Exp..

[10]  D LamMonica,et al.  The cache performance and optimizations of blocked algorithms , 1991 .

[11]  Robert A. van de Geijn,et al.  Generating Families of Practical Fast Matrix Multiplication Algorithms , 2016, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[12]  Tze Meng Low,et al.  Analytical Modeling Is Enough for High-Performance BLIS , 2016, ACM Trans. Math. Softw..

[13]  Robert A. van de Geijn,et al.  BLIS: A Framework for Rapidly Instantiating BLAS Functionality , 2015, ACM Trans. Math. Softw..

[14]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[15]  Field G. Van Zee,et al.  Implementing High-performance Complex Matrix Multiplication via the 3m and 4m Methods , 2017, ACM Trans. Math. Softw..

[16]  Robert A. van de Geijn,et al.  Anatomy of high-performance matrix multiplication , 2008, TOMS.

[17]  Paolo Bientinesi,et al.  Design of a High-Performance GEMM-like Tensor–Tensor Multiplication , 2016, ACM Trans. Math. Softw..

[18]  Christopher H. Nevison Parallel Computing in the Undergraduate Curriculum , 1995, Computer.

[19]  James Demmel,et al.  Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.

[20]  Robert A. van de Geijn,et al.  Strassen's Algorithm for Tensor Contraction , 2017, SIAM J. Sci. Comput..

[21]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[22]  Robert A. van de Geijn,et al.  Using PLAPACK - parallel linear algebra package , 1997 .

[23]  Jack J. Dongarra,et al.  An extended set of FORTRAN basic linear algebra subprograms , 1988, TOMS.

[24]  Franz Franchetti,et al.  How to Write Fast Numerical Code: A Small Introduction , 2007, GTTSE.

[25]  Robert A. van de Geijn,et al.  Anatomy of High-Performance Many-Threaded Matrix Multiplication , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[26]  Devin A. Matthews,et al.  High-Performance Tensor Contraction without Transposition , 2016, SIAM J. Sci. Comput..

[27]  Marsha Meredith Introducing parallel computing into the undergraduate computer science curriculum: a progress report , 1992, SIGCSE '92.

[28]  Jie Wu,et al.  NSF/IEEE-TCPP curriculum initiative on parallel and distributed computing: core topics for undergraduates , 2011, SIGCSE '11.

[29]  Robert A. van de Geijn,et al.  Elemental: A New Framework for Distributed Memory Dense Matrix Computations , 2013, TOMS.

[30]  Robert A. van de Geijn,et al.  Strassen's Algorithm Reloaded , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[31]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.