The BLAS API of BLASFEO: optimizing performance for small matrices

BLASFEO is a dense linear algebra library providing high-performance implementations of BLAS- and LAPACK-like routines for use in embedded optimization and other applications targeting relatively small matrices. BLASFEO defines an API which uses a packed matrix format as its native format. This format is analogous to the internal memory buffers of optimized BLAS, but it is exposed to the user and it removes the packing cost from the routine call. For matrices fitting in cache, BLASFEO outperforms optimized BLAS implementations, both open-source and proprietary. This paper investigates the addition of a standard BLAS API to the BLASFEO framework, and proposes an implementation switching between two or more algorithms optimized for different matrix sizes. Thanks to the modular assembly framework in BLASFEO, tailored linear algebra kernels with mixed column- and panel-major arguments are easily developed. This BLAS API has lower performance than the BLASFEO API, but it nonetheless outperforms optimized BLAS and especially LAPACK libraries for matrices fitting in cache. Therefore, it can boost a wide range of applications, where standard BLAS and LAPACK libraries are employed and the matrix size is moderate. In particular, this paper investigates the benefits in scientific programming languages such as Octave, SciPy and Julia.

[1]  Julien Langou,et al.  A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures , 2007, Parallel Comput..

[2]  Robert A. van de Geijn,et al.  Anatomy of high-performance matrix multiplication , 2008, TOMS.

[3]  Moritz Diehl,et al.  BLASFEO: Basic linear algebra subroutines for embedded optimization , 2017, ACM Trans. Math. Softw..

[4]  Robert A. van de Geijn,et al.  High-performance implementation of the level-3 BLAS , 2008, TOMS.

[5]  Alexander Domahidi,et al.  Embedded optimization methods for industrial automatic control , 2017 .

[6]  Manfred Morari,et al.  Embedded Online Optimization for Model Predictive Control at Megahertz Rates , 2013, IEEE Transactions on Automatic Control.

[7]  Robert A. van de Geijn,et al.  BLIS: A Framework for Rapidly Instantiating BLAS Functionality , 2015, ACM Trans. Math. Softw..

[8]  Daniele G. Spampinato,et al.  A basic linear algebra compiler for structured matrices , 2016, 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[9]  John Bagterp Jørgensen,et al.  High-performance small-scale solvers for linear Model Predictive Control , 2014, 2014 European Control Conference (ECC).

[10]  Moritz Diehl,et al.  An auto-generated real-time iteration algorithm for nonlinear MPC in the microsecond range , 2011, Autom..

[11]  Rakib Hasan,et al.  Maintaining High Performance Across All Problem Sizes and Parallel Scales Using Microkernel-based Linear Algebra , 2017 .

[12]  Charles L. Lawson,et al.  Basic Linear Algebra Subprograms for Fortran Usage , 1979, TOMS.

[13]  Alexander Heinecke,et al.  LIBXSMM: Accelerating Small Matrix Multiplications by Runtime Code Generation , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[14]  Bo Kågström,et al.  GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark , 1998, TOMS.

[15]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[16]  Gianluca Frison,et al.  Algorithms and Methods for High-Performance Model Predictive Control , 2016 .