0 BLIS : A Modern Alternative to the BLAS FIELD

We propose the portable BLAS-like Interface Software (BLIS) framework which addresses a number of shortcomings in both the original BLAS interface and present-day BLAS implementations. The framework allows developers to rapidly instantiate high-performance BLAS-like libraries on existing and new architectures with relatively little effort. The key to this achievement is the observation that virtually all computation within level-2 and level-3 BLAS operations may be expressed in terms of very simple kernels. Higher-level framework code is generalized so that it can be reused and/or re-parameterized for different operations (as well as different architectures) with little to no modification. Inserting high-performance kernels into the framework facilitates the immediate optimization of any and all BLAS-like operations which are cast in terms of these kernels, and thus the framework acts as a productivity multiplier. Users of BLAS-dependent applications are supported through a straightforward compatibility layer, though calling sequences must be updated for those who wish to access new functionality. Experimental performance of level-2 and level-3 operations is observed to be competitive with two mature open source libraries (OpenBLAS and ATLAS) as well as an established commercial product (Intel MKL).

[1]  Franz Franchetti,et al.  SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.

[2]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[3]  Robert A. van de Geijn,et al.  Formal Methods for High-Performance Linear Algebra Libraries , 2000, The Architecture of Scientific Software.

[4]  Robert A. van de Geijn,et al.  A Family of High-Performance Matrix Multiplication Algorithms , 2001, International Conference on Computational Science.

[5]  Robert A. van de Geijn,et al.  The science of deriving dense linear algebra algorithms , 2005, TOMS.

[6]  Robert A. van de Geijn,et al.  Designing Linear Algebra Algorithms by Transformation: Mechanizing the Expert Developer , 2012, VECPAR.

[7]  Robert A. van de Geijn,et al.  The libflame Library for Dense Matrix Computations , 2009, Computing in Science & Engineering.

[8]  James Demmel,et al.  Cache efficient bidiagonalization using BLAS 2.5 operators , 2008, TOMS.

[9]  James Demmel,et al.  A preliminary analysis of Cyclops Tensor Framework , 2012 .

[10]  Robert A. van de Geijn,et al.  Programming matrix algorithms-by-blocks for thread-level parallelism , 2009, TOMS.

[11]  Robert A. van de Geijn,et al.  High-performance implementation of the level-3 BLAS , 2008, TOMS.

[12]  Robert A. van de Geijn,et al.  Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures , 2012, IEEE Transactions on Computers.

[13]  Bo Kågström,et al.  GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark , 1998, TOMS.

[14]  Charles L. Lawson,et al.  Basic Linear Algebra Subprograms for Fortran Usage , 1979, TOMS.

[15]  Robert A. van de Geijn,et al.  SuperMatrix: a multithreaded runtime scheduling system for algorithms-by-blocks , 2008, PPoPP.

[16]  Gang Ren,et al.  Is Search Really Necessary to Generate High-Performance BLAS? , 2005, Proceedings of the IEEE.

[17]  Jack J. Dongarra,et al.  An extended set of FORTRAN basic linear algebra subprograms , 1988, TOMS.

[18]  Ramesh C. Agarwal,et al.  Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms , 1994, IBM J. Res. Dev..

[19]  Elizabeth R. Jessup,et al.  Build to order linear algebra kernels , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[20]  Jack J. Dongarra,et al.  A set of level 3 basic linear algebra subprograms , 1990, TOMS.

[21]  Robert A. van de Geijn,et al.  Restructuring the Tridiagonal and Bidiagonal QR Algorithms for Performance , 2014, ACM Trans. Math. Softw..

[22]  Elizabeth R. Jessup,et al.  Automating the generation of composed linear algebra kernels , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[23]  Robert A. van de Geijn,et al.  Families of Algorithms for Reducing a Matrix to Condensed Form , 2012, TOMS.

[24]  Tze Meng Low,et al.  Accumulating Householder transformations, revisited , 2006, TOMS.

[25]  Robert A. van de Geijn,et al.  FLAME: Formal Linear Algebra Methods Environment , 2001, TOMS.

[26]  Ed Anderson,et al.  LAPACK Users' Guide , 1995 .

[27]  James Demmel,et al.  Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.

[28]  James Demmel,et al.  LU, QR and Cholesky Factorizations using Vector Capabilities of GPUs , 2008 .

[29]  Robert A. van de Geijn,et al.  Anatomy of high-performance matrix multiplication , 2008, TOMS.