论文信息 - 0 BLIS : A Modern Alternative to the BLAS FIELD

0 BLIS : A Modern Alternative to the BLAS FIELD

We propose the portable BLAS-like Interface Software (BLIS) framework which addresses a number of shortcomings in both the original BLAS interface and present-day BLAS implementations. The framework allows developers to rapidly instantiate high-performance BLAS-like libraries on existing and new architectures with relatively little effort. The key to this achievement is the observation that virtually all computation within level-2 and level-3 BLAS operations may be expressed in terms of very simple kernels. Higher-level framework code is generalized so that it can be reused and/or re-parameterized for different operations (as well as different architectures) with little to no modification. Inserting high-performance kernels into the framework facilitates the immediate optimization of any and all BLAS-like operations which are cast in terms of these kernels, and thus the framework acts as a productivity multiplier. Users of BLAS-dependent applications are supported through a straightforward compatibility layer, though calling sequences must be updated for those who wish to access new functionality. Experimental performance of level-2 and level-3 operations is observed to be competitive with two mature open source libraries (OpenBLAS and ATLAS) as well as an established commercial product (Intel MKL).

R. Geijn | G. V. Zee | G. VAN ZEE | ROBERT A. VAN DE GEIJN

[1] Franz Franchetti,et al. SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.

[2] Jack J. Dongarra,et al. Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[3] Robert A. van de Geijn,et al. Formal Methods for High-Performance Linear Algebra Libraries , 2000, The Architecture of Scientific Software.

[4] Robert A. van de Geijn,et al. A Family of High-Performance Matrix Multiplication Algorithms , 2001, International Conference on Computational Science.

[5] Robert A. van de Geijn,et al. The science of deriving dense linear algebra algorithms , 2005, TOMS.

[6] Robert A. van de Geijn,et al. Designing Linear Algebra Algorithms by Transformation: Mechanizing the Expert Developer , 2012, VECPAR.

[7] Robert A. van de Geijn,et al. The libflame Library for Dense Matrix Computations , 2009, Computing in Science & Engineering.

[8] James Demmel,et al. Cache efficient bidiagonalization using BLAS 2.5 operators , 2008, TOMS.

[9] James Demmel,et al. A preliminary analysis of Cyclops Tensor Framework , 2012 .

[10] Robert A. van de Geijn,et al. Programming matrix algorithms-by-blocks for thread-level parallelism , 2009, TOMS.

[11] Robert A. van de Geijn,et al. High-performance implementation of the level-3 BLAS , 2008, TOMS.

[12] Robert A. van de Geijn,et al. Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures , 2012, IEEE Transactions on Computers.

[13] Bo Kågström,et al. GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark , 1998, TOMS.

[14] Charles L. Lawson,et al. Basic Linear Algebra Subprograms for Fortran Usage , 1979, TOMS.

[15] Robert A. van de Geijn,et al. SuperMatrix: a multithreaded runtime scheduling system for algorithms-by-blocks , 2008, PPoPP.

[16] Gang Ren,et al. Is Search Really Necessary to Generate High-Performance BLAS? , 2005, Proceedings of the IEEE.

[17] Jack J. Dongarra,et al. An extended set of FORTRAN basic linear algebra subprograms , 1988, TOMS.

[18] Ramesh C. Agarwal,et al. Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms , 1994, IBM J. Res. Dev..

[19] Elizabeth R. Jessup,et al. Build to order linear algebra kernels , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[20] Jack J. Dongarra,et al. A set of level 3 basic linear algebra subprograms , 1990, TOMS.

[21] Robert A. van de Geijn,et al. Restructuring the Tridiagonal and Bidiagonal QR Algorithms for Performance , 2014, ACM Trans. Math. Softw..

[22] Elizabeth R. Jessup,et al. Automating the generation of composed linear algebra kernels , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[23] Robert A. van de Geijn,et al. Families of Algorithms for Reducing a Matrix to Condensed Form , 2012, TOMS.

[24] Tze Meng Low,et al. Accumulating Householder transformations, revisited , 2006, TOMS.

[25] Robert A. van de Geijn,et al. FLAME: Formal Linear Algebra Methods Environment , 2001, TOMS.

[26] Ed Anderson,et al. LAPACK Users' Guide , 1995 .

[27] James Demmel,et al. Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.

[28] James Demmel,et al. LU, QR and Cholesky Factorizations using Vector Capabilities of GPUs , 2008 .

[29] Robert A. van de Geijn,et al. Anatomy of high-performance matrix multiplication , 2008, TOMS.