Autotuning GEMM Kernels for the Fermi GPU

In recent years, the use of graphics chips has been recognized as a viable way of accelerating scientific and engineering applications, even more so since the introduction of the Fermi architecture by NVIDIA, with features essential to numerical computing, such as fast double precision arithmetic and memory protected with error correction codes. Being the crucial component of numerical software packages, such as LAPACK and ScaLAPACK, the general dense matrix multiplication routine is one of the more important workloads to be implemented on these devices. This paper presents a methodology for producing matrix multiplication kernels tuned for a specific architecture, through a canonical process of heuristic autotuning, based on generation of multiple code variants and selecting the fastest ones through benchmarking. The key contribution of this work is in the method for generating the search space; specifically, pruning it to a manageable size. Performance numbers match or exceed other available implementations.

[1]  Rafael Mayo,et al.  Evaluation and tuning of the Level 3 CUBLAS for graphics processors , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[2]  Nicholas J. Higham Stability of a Method for Multiplying Complex Matrices with Three Real Matrix Multiplications , 1992, SIAM J. Matrix Anal. Appl..

[3]  Marc Snir,et al.  Automatic tuning matrix multiplication performance on graphics hardware , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[4]  Katherine Yelick,et al.  OSKI: A library of automatically tuned sparse matrix kernels , 2005 .

[5]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[6]  E. Normand Single event upset at ground level , 1996 .

[7]  Yang Yang,et al.  Automatic Library Generation for BLAS3 on GPUs , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[8]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[9]  Jack J. Dongarra,et al.  The LINPACK Benchmark: past, present and future , 2003, Concurr. Comput. Pract. Exp..

[10]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[11]  Mary W. Hall,et al.  CHiLL : A Framework for Composing High-Level Loop Transformations , 2007 .

[12]  Franz Franchetti,et al.  SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.

[13]  Jack Dongarra,et al.  Scientific Computing with Multicore and Accelerators , 2010, Chapman and Hall / CRC computational science series.

[14]  Vivek Sarkar,et al.  Software challenges in extreme scale systems , 2009 .

[15]  Robert A. van de Geijn,et al.  Anatomy of high-performance matrix multiplication , 2008, TOMS.

[16]  Bo Kågström,et al.  GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark , 1998, TOMS.

[17]  Naohito Nakasato,et al.  A fast GEMM implementation on the cypress GPU , 2011, PERV.

[18]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[19]  Jack J. Dongarra,et al.  Optimizing matrix multiplication for a short-vector SIMD architecture - CELL processor , 2009, Parallel Comput..

[20]  Jack J. Dongarra,et al.  A Note on Auto-tuning GEMM for GPUs , 2009, ICCS.

[21]  Ed Anderson,et al.  LAPACK Users' Guide , 1995 .

[22]  Gang Ren,et al.  Is Search Really Necessary to Generate High-Performance BLAS? , 2005, Proceedings of the IEEE.

[23]  Jack J. Dongarra,et al.  From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming , 2012, Parallel Comput..

[24]  Ninghui Sun,et al.  Fast implementation of DGEMM on Fermi GPU , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[25]  Aaftab Munshi,et al.  The OpenCL specification , 2009, 2009 IEEE Hot Chips 21 Symposium (HCS).

[26]  David Parello,et al.  Facilitating the search for compositions of program transformations , 2005, ICS '05.

[27]  Adly T. Fam Efficient complex matrix multiplication , 1988 .

[28]  Jack J. Dongarra,et al.  An Improved Magma Gemm For Fermi Graphics Processing Units , 2010, Int. J. High Perform. Comput. Appl..

[29]  Eduardo F. D'Azevedo,et al.  Complex version of high performance computing LINPACK benchmark (HPL) , 2010, Concurr. Comput. Pract. Exp..

[30]  Jack Dongarra,et al.  Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .

[31]  Chun Chen,et al.  A Programming Language Interface to Describe Transformations and Code Generation , 2010, LCPC.

[32]  Jack J. Dongarra,et al.  Accelerating GPU Kernels for Dense Linear Algebra , 2010, VECPAR.