论文信息 - Improving Performance of Matrix Multiplication and FFT on GPU

Improving Performance of Matrix Multiplication and FFT on GPU

In this paper we discuss about our experiences in improving the performance of two key algorithms: the single-precision matrix-matrix multiplication subprogram (SGEMM of BLAS) and single-precision FFT using CUDA. The former is computation-intensive, while the latter is memory bandwidth or communication-intensive. A peak performance of 393 Gflops is achieved on NVIDIA GeForce GTX280 for the former, about 5% faster than the CUBLAS 2.0 library. Better FFT performance results are obtained for a range of dimensions. Some common principles are discussed for the design and implementation of many-core algorithms.

[1] James Demmel,et al. Benchmarking GPUs to tune dense linear algebra , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[2] Steven G. Johnson,et al. The Fastest Fourier Transform in the West , 1997 .

[3] Wen-mei W. Hwu,et al. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.

[4] Naga K. Govindaraju,et al. High performance discrete Fourier transforms on graphics processors , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[5] Emilio L. Zapata,et al. Memory Locality Exploitation Strategies for FFT on the CUDA Architecture , 2008, VECPAR.

[6] V. Volkov,et al. Fitting FFT onto the G 80 Architecture , 2008 .