Patterns of Inefficient Performance Behavior in GPU Applications

Writing efficient software for heterogeneous architectures equipped with modern accelerator devices presents a serious challenge to programmer productivity, creating a need for powerful performance-analysis tools to adequately support the software development process. To guide the design of such tools, we describe typical patterns of inefficient runtime behavior that may adversely affect the performance of applications that use general-purpose processors along with GPU devices through a CUDA compute engine. To evaluate the general impact of these patterns on application performance, we further present a micro benchmark suite that allows the performance penalty of each pattern to be quantified with results obtained on NVIDIA Fermi and Tesla architectures, indeed demonstrating significant delays. Furthermore this suite can be used as a default test scenario to add CUDA support to performance-analysis tools used in high-performance computing.

[1]  Allen D. Malony,et al.  An experimental approach to performance measurement of heterogeneous parallel applications using CUDA , 2010, ICS '10.

[2]  Kevin Skadron,et al.  A performance study of general-purpose applications on graphics processors using CUDA , 2008, J. Parallel Distributed Comput..

[3]  Matthias S. Müller,et al.  The Vampir Performance Analysis Tool-Set , 2008, Parallel Tools Workshop.

[4]  Wen-mei W. Hwu,et al.  MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs , 2008, LCPC.

[5]  Jeffrey K. Hollingsworth,et al.  Grindstone: A Test Suite for Parallel Performance Tools , 1998 .

[6]  Jie Cheng,et al.  Programming Massively Parallel Processors. A Hands-on Approach , 2010, Scalable Comput. Pract. Exp..

[7]  Aaftab Munshi,et al.  The OpenCL specification , 2009, 2009 IEEE Hot Chips 21 Symposium (HCS).

[8]  Guido Juckeland,et al.  High Resolution Program Flow Visualization of Hardware Accelerated Hybrid Multi-core Applications , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[9]  Bernd Mohr,et al.  A test suite for parallel performance analysis tools , 2007, Concurr. Comput. Pract. Exp..

[10]  Jason Cong,et al.  High-performance CUDA kernel execution on FPGAs , 2009, ICS.

[11]  Matt Pharr,et al.  Gpu gems 2: programming techniques for high-performance graphics and general-purpose computation , 2005 .

[12]  Michael Boyer Automated Dynamic Analysis of CUDA Programs , 2008 .

[13]  Zeljko Hocenski,et al.  Parallel Processing with CUDA in Ceramic Tiles Classification , 2010, KES.