Auto-tuning Dense Matrix Multiplication for GPGPU with Cache

In this paper we discuss about our experiences in improving the performance of GEMM (both single and double precision) on Fermi architecture using CUDA, and how the new features of Fermi such as cache affect performance. It is found that the addition of cache in GPU on one hand helps the processers take advantage of data locality occurred in runtime but on the other hand renders the dependency of performance on algorithmic parameters less predictable. Auto tuning then becomes a useful technique to address this issue. Our auto-tuned SGEMM and DGEMM reach 563 GFlops and 253 GFlops respectively on Tesla C2050. The design and implementation entirely use CUDA and C and have not benefited from tuning at the level of binary code.

[1]  Yifeng Chen,et al.  Improving Performance of Matrix Multiplication and FFT on GPU , 2009, 2009 15th International Conference on Parallel and Distributed Systems.

[2]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[3]  Jack J. Dongarra,et al.  A Note on Auto-tuning GEMM for GPUs , 2009, ICCS.