Identifying Optimization Opportunities Within Kernel Execution in GPU Codes

Tuning codes for GPGPU architectures is challenging because few performance tools can pinpoint the exact causes of execution bottlenecks. While profiling applications can reveal execution behavior with a particular architecture, the abundance of collected information can also overwhelm the user. Moreover, performance counters provide cumulative values but does not attribute events to code regions, which makes identifying performance hot spots difficult. This research focuses on characterizing the behavior of GPU application kernels and its performance at the node level by providing a visualization and metrics display that indicates the behavior of the application with respect to the underlying architecture. We demonstrate the effectiveness of our techniques with LAMMPS and LULESH application case studies on a variety of GPU architectures. By sampling instruction mixes for kernel execution runs, we reveal a variety of intrinsic program characteristics relating to computation, memory and control flow.

[1]  Isaac D. Scherson,et al.  Computationally Efficient Multiplexing of Events on Hardware Counters , 2014 .

[2]  Matthias S. Müller,et al.  The Vampir Performance Analysis Tool-Set , 2008, Parallel Tools Workshop.

[3]  Shirley Moore,et al.  Non-determinism and overcount on modern hardware performance counter implementations , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[4]  Allen D. Malony,et al.  Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs , 2011, 2011 International Conference on Parallel Processing.

[5]  Richard W. Vuduc,et al.  Performance Analysis and Tuning for General Purpose Graphics Processing Units (GPGPU) , 2012, Synthesis Lectures on Computer Architecture.

[6]  Sudhakar Yalamanchili,et al.  A characterization and analysis of PTX kernels , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[7]  David M. Brooks,et al.  ISA-independent workload characterization and its implications for specialized architectures , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[8]  Steve Plimpton,et al.  Fast parallel algorithms for short-range molecular dynamics , 1993 .

[9]  John M. Mellor-Crummey,et al.  Effective sampling-driven performance tools for GPU-accelerated supercomputers , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[10]  Ian Karlin,et al.  LULESH Programming Model and Performance Ports Overview , 2012 .

[11]  Guido Juckeland,et al.  Non-intrusive Performance Analysis of Parallel Hardware Accelerated Applications on Hybrid Architectures , 2010, 2010 39th International Conference on Parallel Processing Workshops.

[12]  Hyesoon Kim,et al.  Performance Analysis and Tuning for General Purpose Graphics Processing Units , 2012 .

[13]  Hyesoon Kim,et al.  An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.

[14]  Allen D. Malony,et al.  The Tau Parallel Performance System , 2006, Int. J. High Perform. Comput. Appl..

[15]  P. Sadayappan,et al.  Annotation-based empirical performance tuning using Orio , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[16]  Allen D. Malony,et al.  Design and Implementation of a Hybrid Parallel Performance Measurement System , 2010, 2010 39th International Conference on Parallel Processing.

[17]  Jack J. Dongarra,et al.  A Portable Programming Interface for Performance Evaluation on Modern Processors , 2000, Int. J. High Perform. Comput. Appl..