RT-CUDA: A Software Tool for CUDA Code Restructuring

Recent development in graphic processing units (GPUs) has opened a new challenge in harnessing their computing power as a new general purpose computing paradigm. However, porting applications to CUDA remains a challenge to average programmers, which have to package code in separate functions, explicitly manage data transfers between the host and device memories, and manually optimize GPU memory utilization. In this paper, we propose a restructuring tool (RT-CUDA) that takes a C-like program and some user directives as compiler hints to produce an optimized CUDA code. The tool strategy is based on efficient management of the memory system to minimize data motion by managing the transfer between host and device, maximizing bandwidth for device memory accesses, and enhancing data locality and re-use of cached data using shared-memory and registers. Enhanced resource utilization is implemented by re-writing code as parametric kernels and use of efficient auto-tuning. The tool enables calling numerical libraries (CuBLAS, CuSPARSE, etc.) to help implement applications in science simulation like iterative linear algebra solvers. For the above applications, the tool implement an inter-block global synchronization which allow the execution overall among a few iterations which is helpful to balance load and to avoid polling. Evaluation of RT-CUDA has been performed using a variety of basic linear algebra operators (Madd, MM, MV, VV, etc.) as well as the programming of iterative solvers for systems of linear equations like Jacobi and Conjugate Gradient algorithms. Significant speedup has been achieved over other compilers like PGI OpenACC and GPGPU compilers for the above applications. Evaluation shows that generated kernels efficiently call math libraries and enable implementing complete iterative solvers. The tool help scientists developing parallel simulators like reservoir simulators, molecular dynamics, etc. without exposing to complexity of GPU and CUDA programming. We have partnership with a group of researchers at the Saudi Aramco, a national company in Saudi Arabia. RT-CUDA is currently explored as a potential development tool for applications involving linear algebra solvers by the above group. In addition, RT-CUDA is being used by Senior and Graduate students at King Fahd University of Petroleum and Minerals in their projects as part of RT-CUDA continuous enhancement.

[1]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, HiPC 2008.

[2]  Mark S. Peercy,et al.  A performance-oriented data parallel virtual machine for GPUs , 2006, SIGGRAPH '06.

[3]  Yi Yang,et al.  A GPGPU compiler for memory optimization and parallelism management , 2010, PLDI '10.

[4]  Fumihiko Ino,et al.  A code motion technique for accelerating general-purpose computation on the GPU , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[5]  David A. Patterson,et al.  Computer Architecture, Fifth Edition: A Quantitative Approach , 2011 .

[6]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[7]  Adrian Jackson,et al.  Dynamic Loop Parallelisation , 2012, ArXiv.

[8]  Yi Yang,et al.  The Implementation of a High Performance GPGPU Compiler , 2012, International Journal of Parallel Programming.

[9]  C Cedric Nugteren,et al.  Improving the Programmability of GPU Architectures , 2014 .

[10]  Vincent Rijmen,et al.  The Design of Rijndael: AES - The Advanced Encryption Standard , 2002 .

[11]  Vincent Rijmen,et al.  The Design of Rijndael , 2002, Information Security and Cryptography.

[12]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[13]  Andrei P. Ershov On programming of arithmetic operations , 1958, CACM.

[14]  Volodymyr Kindratenko,et al.  Porting Optimized GPU Kernels to a Multi-core CPU: Computational Quantum Chemistry Application Example , 2011, 2011 Symposium on Application Accelerators in High-Performance Computing.

[15]  Brian Vinter,et al.  An Efficient GPU General Sparse Matrix-Matrix Multiplication for Irregular Data , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[16]  Bronis R. de Supinski,et al.  OpenMP for Accelerators , 2011, IWOMP.

[17]  Nazeeruddin Mohammad,et al.  Erratum to: Optimizing the Matrix Multiplication Using Strassen and Winograd Algorithms with Limited Recursions on Many-Core , 2015, International Journal of Parallel Programming.

[18]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[19]  Nicholas Wilt,et al.  The CUDA Handbook: A Comprehensive Guide to GPU Programming , 2013 .

[20]  Guibin Wang,et al.  Coordinate strip-mining and kernel fusion to lower power consumption on GPU , 2011, 2011 Design, Automation & Test in Europe.

[21]  William J. Dally,et al.  A Hierarchical Thread Scheduler and Register File for Energy-Efficient Throughput Processors , 2012, TOCS.

[22]  Wen-mei W. Hwu,et al.  Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.

[23]  Jacqueline Chame,et al.  A script-based autotuning compiler system to generate high-performance CUDA code , 2013, TACO.

[24]  Philippas Tsigas,et al.  The Synchronization Power of Coalesced Memory Accesses , 2010, IEEE Transactions on Parallel and Distributed Systems.

[25]  Chun Chen,et al.  Combining models and guided empirical search to optimize for multiple levels of the memory hierarchy , 2005, International Symposium on Code Generation and Optimization.

[26]  G. D. Peterson,et al.  Power Aware Computing on GPUs , 2012, 2012 Symposium on Application Accelerators in High Performance Computing.

[27]  Akiyoshi Wakatani Effectiveness of a strip-mining approach for VQ image coding using GPGPU implementation , 2009, 2009 24th International Conference Image and Vision Computing New Zealand.

[28]  Uday Bondhugula,et al.  A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.

[29]  Rudolf Eigenmann,et al.  OpenMPC: extended OpenMP for efficient programming and tuning on GPUs , 2013, Int. J. Comput. Sci. Eng..

[30]  Mayez A. Al-Mouhamed,et al.  Padding Free Bank Conflict Resolution for CUDA-Based Matrix Transpose Algorithm , 2014, Int. J. Networked Distributed Comput..

[31]  Henk Corporaal,et al.  Compile-time GPU memory access optimizations , 2010, 2010 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation.

[32]  Benoît Meister,et al.  A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction , 2010, GPGPU-3.

[33]  Hyeran Jeon,et al.  Graph processing on GPUs: Where are the bottlenecks? , 2014, 2014 IEEE International Symposium on Workload Characterization (IISWC).

[34]  Wu-chun Feng,et al.  Inter-block GPU communication via fast barrier synchronization , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[35]  Wen-mei W. Hwu,et al.  CUDA-Lite: Reducing GPU Programming Complexity , 2008, LCPC.

[36]  Roy H. Campbell,et al.  Plasma: Shared Memory Dynamic Allocation and Bank-Conflict-Free Access in GPUs , 2012, 2012 41st International Conference on Parallel Processing Workshops.

[37]  Pat Hanrahan,et al.  Brook for GPUs: stream computing on graphics hardware , 2004, SIGGRAPH 2004.

[38]  Zhaohui Du,et al.  Data and computation transformations for Brook streaming applications on multiprocessors , 2006, International Symposium on Code Generation and Optimization (CGO'06).

[39]  Tarek S. Abdelrahman,et al.  hiCUDA: a high-level directive-based language for GPU programming , 2009, GPGPU-2.

[40]  Rudolf Eigenmann,et al.  OpenMP to GPGPU: a compiler framework for automatic translation and optimization , 2009, PPoPP '09.

[41]  P. Sadayappan,et al.  Optimal loop unrolling for GPGPU programs , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).