Performance hotspot based CUDA acceleration

With the introduction of many-core GPUs, there is widespread interest in using GPUs to accelerate non-graphics applications such as bioinformatics, energy, finance and several research areas. Even though the GPUs provide highly parallel processing capability, performance improvement is not always achievable due to multiple reasons. One of them will be the application mapping to the hardware acceleration module. In this paper, we investigate performance hotspot functions of an application and used it for application mapping for CUDA acceleration. Based on our experiments with three non-graphic applications, hotspot function based CUDA acceleration shows 15%-40% performance improvement on GPGPU with minimal efforts.

[1]  Huei Wang,et al.  Development of millimeter-wave CMOS power amplifiers at National Taiwan University , 2012, 2012 International SoC Design Conference (ISOCC).

[2]  Benoît Meister,et al.  A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction , 2010, GPGPU-3.

[3]  Patrice Quinton,et al.  Parallelizing HMMER for Hardware Acceleration on FPGAs , 2007, 2007 IEEE International Conf. on Application-specific Systems, Architectures and Processors (ASAP).

[4]  A. Dupret,et al.  Low Power Motion Detection with Low Spatial and Temporal Resolution for CMOS Image Sensor , 2007, 2006 International Workshop on Computer Architecture for Machine Perception and Sensing.

[5]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[6]  Jack J. Dongarra,et al.  The PlayStation 3 for High-Performance Scientific Computing , 2008, Computing in Science & Engineering.

[7]  J. Ramanujam,et al.  Automatic C-to-CUDA Code Generation for Affine Programs , 2010, CC.

[8]  Pat Hanrahan,et al.  ClawHMMER: A Streaming HMMer-Search Implementatio , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[9]  Jonathan Cohen,et al.  Title: A Fast Double Precision CFD Code using CUDA , 2009 .

[10]  Michael Klemm,et al.  JCudaMP: OpenMP/Java on CUDA , 2010, IWMSE '10.

[11]  John D. Owens,et al.  GPU Computing , 2008, Proceedings of the IEEE.

[12]  Arie E. Kaufman,et al.  GPU Cluster for High Performance Computing , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[13]  Trevor Mudge,et al.  MiBench: A free, commercially representative embedded benchmark suite , 2001 .

[14]  Kevin Skadron,et al.  Accelerating SQL database operations on a GPU with CUDA , 2010, GPGPU-3.

[15]  Pat Hanrahan,et al.  ClawHMMER: A Streaming HMMer-Search Implementation , 2005, SC.

[16]  Hyesoon Kim,et al.  Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[17]  Wen-mei W. Hwu,et al.  CUDA-Lite: Reducing GPU Programming Complexity , 2008, LCPC.

[18]  David R. Kaeli,et al.  Exploring the multiple-GPU design space , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.