On Improving the Performance of Multi-threaded CUDA Applications with Concurrent Kernel Execution by Kernel Reordering

General-purpose graphics processing units (GPUs) have been found to be viable solutions for large-scale numerical computations with an inherent potential for massive parallelism. In contrast, only few is known about using GPUs for small-scale computations. To have the GPU not be under-utilized for small problem sizes, a meaningful approach is to perform as many small-scale computations as possible in a concurrent manner. On NVIDIA Fermi GPUs, the concept of Concurrent Kernel Execution (CKE) allows for the execution of up to 16 GPU kernels on a single device. While using CKE in single-threaded CUDA programs is straightforward, for multi-threaded programs it might become a challenge to manage multiple host threads interacting with the GPU device, and in addition to have the CKE concept work properly. It can be observed that CKE performance breaks down when multiple host threads each invoke multiple GPU kernels in succession without synchronizing their actions. Since in real-world applications it is common that multiple host threads process their data independently, a mechanism is needed that helps avoiding CKE breakdown. We propose a producer-consumer principle approach to manage GPU kernel invocations from within parallel host regions by reordering the respective GPU kernels before actually invoking them. We are able to demonstrate significant performance improvements with this technique in a strong scaling simulation of a small molecule solvated within a nanodroplet.

[1]  Klaus Schulten,et al.  Accelerating Molecular Modeling Applications with GPU Computing , 2009 .

[2]  Ivan S. Ufimtsev,et al.  Dynamic Precision for Electron Repulsion Integral Evaluation on Graphical Processing Units (GPUs). , 2011, Journal of chemical theory and computation.

[3]  Amr H. Hassan,et al.  Astrophysical Supercomputing with GPUs: Critical Decisions for Early Adopters* , 2010, Publications of the Astronomical Society of Australia.

[4]  Thomas A. Halgren Merck molecular force field. I. Basis, form, scope, parameterization, and performance of MMFF94 , 1996, J. Comput. Chem..

[5]  Carsten Kutzner,et al.  GROMACS 4:  Algorithms for Highly Efficient, Load-Balanced, and Scalable Molecular Simulation. , 2008, Journal of chemical theory and computation.

[6]  Fiete Haack,et al.  Adaptive Spectral Clustering for Conformation Analysis , 2010 .

[7]  Yongchao Liu,et al.  Mapping of BLASTP Algorithm onto GPU Clusters , 2011, 2011 IEEE 17th International Conference on Parallel and Distributed Systems.

[8]  Klaus Schulten,et al.  GPU Algorithms for Molecular Modeling , 2010, Scientific Computing with Multicore and Accelerators.

[9]  Guillermo Marcus Martinez,et al.  Astrophysical Particle Simulations with Custom GPU Clusters , 2010, 2010 10th IEEE International Conference on Computer and Information Technology.

[10]  Klaus Schulten,et al.  Adapting a message-driven parallel application to GPU-accelerated clusters , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[11]  Vikram K. Narayana,et al.  Scaling scientific applications on clusters of hybrid multicore/GPU nodes , 2011, CF '11.

[12]  Yongchao Liu,et al.  CUDASW++2.0: enhanced Smith-Waterman protein database search on CUDA-enabled GPUs based on SIMT and virtualized SIMD abstractions , 2010, BMC Research Notes.

[13]  Vijay S. Pande,et al.  OpenMM: A Hardware-Independent Framework for Molecular Simulations , 2010, Computing in Science & Engineering.

[14]  Guillermo Marcus Martinez,et al.  Astrophysical particle simulations with large custom GPU clusters on three continents , 2011, Computer Science - Research and Development.

[15]  T. Halgren Merck molecular force field. I. Basis, form, scope, parameterization, and performance of MMFF94 , 1996, J. Comput. Chem..

[16]  Tarek A. El-Ghazawi,et al.  Exploiting concurrent kernel execution on graphic processing units , 2011, 2011 International Conference on High Performance Computing & Simulation.

[17]  Tarek El-Ghazawi,et al.  Towards efficient GPU sharing on multicore processors , 2011, PMBS '11.

[18]  Ivan S Ufimtsev,et al.  Quantum Chemistry on Graphical Processing Units. 3. Analytical Energy Gradients, Geometry Optimization, and First Principles Molecular Dynamics. , 2009, Journal of chemical theory and computation.

[19]  Weiguo Liu,et al.  CUDA-BLASTP: Accelerating BLASTP on CUDA-Enabled Graphics Hardware , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[20]  Kevin Skadron,et al.  Enabling Task Parallelism in the CUDA Scheduler , 2009 .