BLISlab: A Sandbox for Optimizing GEMM

Matrix-matrix multiplication is a fundamental operation of great importance to scientific computing and, increasingly, machine learning. It is a simple enough concept to be introduced in a typical high school algebra course yet in practice important enough that its implementation on computers continues to be an active research topic. This note describes a set of exercises that use this operation to illustrate how high performance can be attained on modern CPUs with hierarchical memories (multiple caches). It does so by building on the insights that underly the BLAS-like Library Instantiation Software (BLIS) framework by exposing a simplified "sandbox" that mimics the implementation in BLIS. As such, it also becomes a vehicle for the "crowd sourcing" of the optimization of BLIS. We call this set of exercises BLISlab.

[1]  James Demmel,et al.  Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.

[2]  Robert A. van de Geijn,et al.  BLIS: A Framework for Rapidly Instantiating BLAS Functionality , 2015, ACM Trans. Math. Softw..

[3]  Robert A. van de Geijn,et al.  Anatomy of high-performance matrix multiplication , 2008, TOMS.

[4]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[5]  Bo Kågström,et al.  GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark , 1998, TOMS.

[6]  Robert A. van de Geijn,et al.  Unleashing the high-performance and low-power of multi-core DSPs for general-purpose HPC , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[7]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[8]  Robert A. van de Geijn,et al.  Anatomy of High-Performance Many-Threaded Matrix Multiplication , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[9]  Jack J. Dongarra,et al.  An extended set of FORTRAN basic linear algebra subprograms , 1988, TOMS.

[10]  Tze Meng Low,et al.  Analytical Modeling Is Enough for High-Performance BLIS , 2016, ACM Trans. Math. Softw..

[11]  Charles L. Lawson,et al.  Basic Linear Algebra Subprograms for Fortran Usage , 1979, TOMS.

[12]  Ramesh C. Agarwal,et al.  Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms , 1994, IBM J. Res. Dev..

[13]  Robert A. van de Geijn,et al.  A Family of High-Performance Matrix Multiplication Algorithms , 2001, International Conference on Computational Science.

[14]  Jack J. Dongarra,et al.  A set of level 3 basic linear algebra subprograms , 1990, TOMS.

[15]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[16]  Tze Meng Low,et al.  The BLIS Framework , 2016 .