Auto-tuning a Matrix Routine for High Performance

Well-written scientific simulations typically get tremendous performance gains by using highly optimized library routines. Some of the most fundamental of these routines perform matrix-matrix multiplications and related routines, known as BLAS (Basic Linear Algebra Subprograms). Optimizing these library routines for efficiency is therefore of tremendous importance for many scientific simulations. In fact, some of them are often hand-optimized in assembly language for a given processor, in order to get the best possible performance. In this paper, we present a new tuning approach, combining a small snippet of assembly code with an auto-tuner. For our preliminary test-case, the symmetric rank-2 update, the resulting routine outperforms the best auto-tuner and vendor supplied code on our target machine, an Intel quad-core processor. It also performs less than 1.2% slower than the best hand coded library. Our novel approach shows a lot of promise for further performance gains on modern multi-core and many-core processors.

[1]  Robert A. van de Geijn,et al.  Anatomy of high-performance matrix multiplication , 2008, TOMS.

[2]  Brian T. Smith,et al.  Matrix Eigensystem Routines — EISPACK Guide , 1974, Lecture Notes in Computer Science.

[3]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[4]  Jack Dongarra,et al.  LAPACK: a portable linear algebra library for high-performance computers , 1990, SC.

[5]  Kei Hiraki,et al.  The performance of GRAPE-DR for dense matrix operations , 2011, ICCS.

[6]  William Gropp,et al.  Efficient Management of Parallelism in Object-Oriented Numerical Software Libraries , 1997, SciTools.

[7]  J. Demmel,et al.  Sun Microsystems , 1996 .

[8]  Chun Chen,et al.  Autotuning and Specialization: Speeding up Matrix Multiply for Small Matrices with Compiler Technology , 2010, Software Automatic Tuning, From Concepts to State-of-the-Art Results.

[9]  Michael E. Wolf,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[10]  Robert A. van de Geijn,et al.  High-performance implementation of the level-3 BLAS , 2008, TOMS.

[11]  James Demmel,et al.  Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.

[12]  Gang Ren,et al.  Is Search Really Necessary to Generate High-Performance BLAS? , 2005, Proceedings of the IEEE.

[13]  Rune Erlend Jensen,et al.  Techniques and Tools for Optimizing Codes on Modern Architectures: : A Low-Level Approach , 2009 .

[14]  B. S. Garbow,et al.  Matrix Eigensystem Routines — EISPACK Guide , 1974, Lecture Notes in Computer Science.

[15]  Tamara G. Kolda,et al.  An overview of the Trilinos project , 2005, TOMS.

[16]  Charles L. Lawson,et al.  Basic Linear Algebra Subprograms for Fortran Usage , 1979, TOMS.

[17]  JAMES DEMMEL,et al.  LAPACK: A portable linear algebra library for high-performance computers , 1990, Proceedings SUPERCOMPUTING '90.

[18]  Gang Ren,et al.  Analytic Models and Empirical Search: A Hybrid Approach to Code Optimization , 2005, LCPC.

[19]  Jack J. Dongarra,et al.  A set of level 3 basic linear algebra subprograms , 1990, TOMS.

[20]  Jack Dongarra,et al.  LINPACK Users' Guide , 1987 .

[21]  Naohito Nakasato,et al.  A fast GEMM implementation on the cypress GPU , 2011, PERV.

[22]  Anne C. Elster,et al.  Basic Matrix Subprograms for Distributed Memory Systems , 1990, Proceedings of the Fifth Distributed Memory Computing Conference, 1990..

[23]  B. S. Garbow,et al.  Matrix Eigensystem Routines — EISPACK Guide , 1974, Lecture Notes in Computer Science.