Fixing Performance Bugs: An Empirical Study of Open-Source GPGPU Programs

Given the extraordinary computational power of modern graphics processing units (GPUs), general purpose computation on GPUs (GPGPU) has become an increasingly important platform for high performance computing. To better understand how well the GPU resource has been utilized by application developers and then to facilitate them to develop high performance GPGPU code, we conduct an empirical study on GPGPU programs from ten open-source projects. These projects span a wide range of disciplines and many are designed as high performance libraries. Among these projects, we found various performance 'bugs', i.e., code segments leading to inefficient use of GPU hardware. We characterize these performance bugs, and propose the bug fixes. Our experiments confirm both significant performance gains and energy savings from our fixes and reveal interesting insights on different GPUs.

[1]  Ramani Duraiswami,et al.  Middleware for programming NVIDIA GPUs from Fortran 9X , 2007 .

[2]  Yi Yang,et al.  A GPGPU compiler for memory optimization and parallelism management , 2010, PLDI '10.

[3]  Sean Rul,et al.  An experimental study on performance portability of OpenCL kernels , 2010, HiPC 2010.

[4]  Jack Dongarra,et al.  An Improved MAGMA GEMM for Fermi GPUs , 2010 .

[5]  Xipeng Shen,et al.  A cross-input adaptive framework for GPU program optimizations , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[6]  Noel Lopes,et al.  GPUMLib: A new Library to combine Machine Learning algorithms with Graphics Processing Units , 2010, 2010 10th International Conference on Hybrid Intelligent Systems.

[7]  Alice C. Quillen,et al.  QYMSYM: A GPU-accelerated hybrid symplectic integrator that permits close encounters , 2010, 1007.3458.

[8]  David R. Kaeli,et al.  Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures , 2011, IEEE Transactions on Parallel and Distributed Systems.

[9]  Thomas Fahringer,et al.  Automatic OpenCL Device Characterization: Guiding Optimized Kernel Design , 2011, Euro-Par.

[10]  Pradeep Dubey,et al.  Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU , 2010, ISCA.

[11]  Wen-mei W. Hwu,et al.  Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.

[12]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[13]  Noriyuki Fujimoto Dense Matrix-Vector Multiplication on the CUDA Architecture , 2008, Parallel Process. Lett..

[14]  Kai Lu,et al.  Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing , 2010, 2010 IEEE International Conference on Cluster Computing.

[15]  Wolfgang Paul,et al.  GPU accelerated Monte Carlo simulation of the 2D and 3D Ising model , 2009, J. Comput. Phys..

[16]  Amitabh Varshney,et al.  High-throughput sequence alignment using Graphics Processing Units , 2007, BMC Bioinformatics.

[17]  Wen-mei W. Hwu,et al.  Data Layout Transformation Exploiting Memory-Level Parallelism in Structured Grid Many-Core Applications , 2010, International Journal of Parallel Programming.