Padding free bank conflict resolution for CUDA-based matrix transpose algorithm

Matrix Transposition is an important linear algebra procedure that has deep impact in various computational science and engineering applications. Several factors hinder the expected performance of large matrix transpose on Graphic Processing Units (GPUs). The degradation in performance involves the memory access pattern such as coalesced access in the global memory and bank conflict in the shared memory of streaming multiprocessors within the GPU. In this paper, two matrix transpose algorithms are proposed to alleviate the aforementioned issues of ensuring coalesced access and conflict free bank access. The proposed algorithms have comparable execution times with the NVIDIA SDK bank conflict - free matrix transpose implementation. The main advantage of proposed algorithms is that they eliminate bank conflicts while allocating shared memory exactly equal to the tile size (T × T) of the problem space. However, to the best of our knowledge an extra space of Tx(T +1) needs to be allocated in the published research. We have also applied the proposed transpose algorithm to recursive Gaussian implementation of NVIDIA SDK and achieved about 6% improvement in performance.

[1]  Wei Lin,et al.  Four styles of parallel and net programming , 2009, Frontiers of Computer Science in China.

[2]  Yooseong Kim,et al.  CuMAPz: A tool to analyze memory access patterns in CUDA , 2011, 2011 48th ACM/EDAC/IEEE Design Automation Conference (DAC).

[3]  Jack J. Dongarra,et al.  Optimizing symmetric dense matrix-vector multiplication on GPUs , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[4]  Jizhou Sun,et al.  Auto-Tuning of Thread Assignment for Matrix-Vector Multiplication on GPUs , 2013, IEICE Trans. Inf. Syst..

[5]  Yifeng Chen,et al.  Large-scale FFT on GPU clusters , 2010, ICS '10.

[6]  Yong Tang,et al.  Gregex: GPU Based High Speed Regular Expression Matching Engine , 2011, 2011 Fifth International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing.

[7]  Adam Moravanszky,et al.  Dense Matrix Algebra on the GPU , 2011 .

[8]  Koji Nakano Optimal Parallel Algorithms for Computing the Sum, the Prefix-Sums, and the Summed Area Table on the Memory Machine Models , 2013, IEICE Trans. Inf. Syst..

[9]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[10]  D. K. Bogolepov,et al.  Simplified photon mapping for real-time caustics rendering , 2011, Programming and Computer Software.

[11]  Jaeyoung Choi,et al.  Parallel Matrix Transpose Algorithms on Distributed Memory Concurrent Computers , 1995, Parallel Comput..

[12]  José Aguilar Heuristic algorithm based on a genetic algorithm for mapping parallel programs on hypercube multiprocessors , 2003, Comput. Syst. Sci. Eng..

[13]  Satoshi Matsuoka,et al.  An efficient, model-based CPU-GPU heterogeneous FFT library , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[14]  R. Deriche Recursively Implementing the Gaussian and its Derivatives , 1993 .

[15]  Vladimir A. Frolov,et al.  Biased solution of integral illumination equation via irradiance caching and path tracing on GPUs , 2011, Programming and Computer Software.

[16]  Justin P. Haldar,et al.  Accelerating advanced mri reconstructions on gpus , 2008, CF '08.

[17]  I. N. Skopin,et al.  A method for solving mass point-in-covering problems for arbitrary coverings using GPU , 2013, Programming and Computer Software.

[18]  Hyesoon Kim,et al.  An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.

[19]  Depei Qian,et al.  Challenges and possible approaches: towards the petaflops computers , 2009, Frontiers of Computer Science in China.