Performance Debugging of GPGPU Applications with the Divergence Map

The increasing programability and the high computational power of Graphical Processing Units (GPU) make them attractive to general purpose programming. However, taking full bene t of this execution environment is a challenging task. One of these challenges stem from divergences, a phenomenon that occurs when threads that execute in lock-step are forced to take di erent program paths due to branches in the code. In face of divergences, some threads will have to wait, idly, while their diverging siblings execute. Optimizing the code to avoid divergences is diffcult, because this task demands a deep understanding of programs that might be large and convoluted. In order to facilitate the detection of divergences, this paper introduces the divergence map, a data structure that indicates the location and the volume of divergences in a program. We build this map via dynamic profiling techniques, which we have implemented on top of an open source CUDA compiler. To illustrate the importance of the divergence map, we have used it to pin-point the core regions that must be optimized in well known public applications. By hand optimizing some applications, we have added 9-11% speedups onto kernels that have already gone through the sieve of many programmers.

[1]  William Gropp,et al.  An adaptive performance modeling tool for GPU architectures , 2010, PPoPP '10.

[2]  Stijn Eyerman,et al.  Per-thread cycle accounting in SMT processors , 2009, ASPLOS.

[3]  Sudhakar Yalamanchili,et al.  A characterization and analysis of PTX kernels , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[4]  Barton P. Miller,et al.  The Paradyn Parallel Performance Measurement Tool , 1995, Computer.

[5]  Edward T. Grochowski,et al.  Larrabee: A many-Core x86 architecture for visual computing , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[6]  Matthias Hauswirth,et al.  Evaluating the accuracy of Java profilers , 2010, PLDI '10.

[7]  Kai Li,et al.  Performance measurements for multithreaded programs , 1998, SIGMETRICS '98/PERFORMANCE '98.

[8]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[9]  Kun Zhou,et al.  BSGP: bulk-synchronous GPU programming , 2008, SIGGRAPH 2008.

[10]  Nathan R. Tallent,et al.  Effective performance measurement and analysis of multithreaded applications , 2009, PPoPP '09.

[11]  Kenneth E. Batcher,et al.  Sorting networks and their applications , 1968, AFIPS Spring Joint Computing Conference.

[12]  Wen-mei W. Hwu,et al.  Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.

[13]  Mark J. Harris,et al.  Parallel Prefix Sum (Scan) with CUDA , 2011 .

[14]  Tor M. Aamodt,et al.  Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[15]  Susan L. Graham,et al.  gprof: a call graph execution profiler (with retrospective) , 1982 .

[16]  Norman P. Jouppi,et al.  Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[17]  Avi Mendelson,et al.  Programming model for a heterogeneous x86 platform , 2009, PLDI '09.

[18]  Philippas Tsigas,et al.  GPU-Quicksort: A practical Quicksort algorithm for graphics processors , 2010, JEAL.

[19]  Oscar Naim,et al.  Dynamic instrumentation of threaded applications , 1999, PPoPP '99.

[20]  Michael Boyer Automated Dynamic Analysis of CUDA Programs , 2008 .

[21]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.