论文信息 - Utilizing GPU Performance Counters to Characterize GPU Kernels via Machine Learning

Utilizing GPU Performance Counters to Characterize GPU Kernels via Machine Learning

GPU computing kernels are relatively simple to write if achieving the best performance is not of the highest priority. However, it can quickly become a much more daunting task when users try to tune and optimize their kernels to obtain the highest performance. This is due to GPUs’ massive degree of parallelism, complex memory hierarchy, fine grain synchronization, and long memory access latency. Hence, users must carry out the complex tasks of profiling, analyzing, and tuning to reduce performance bottlenecks. Today’s GPUs can generate hundreds of performance events that comprehensively quantify the behavior of a kernel. Instead of relying on experts’ manual analysis, this paper targets using machine learning methods to generalize GPU performance counter data to determine the characteristics of a GPU kernel as they will reveal possible reasons for low performance. We choose a set of problem-independent counters as our inputs to design and compare three machine learning methods to automatically classify the execution behavior of a kernel. The experimental results on stencil computing kernels and sparse matrix multiplications show the machine learning models’ good accuracy, and demonstrate a feasible approach that is capable of classifying a kernel’s characterizations and suggesting changes to a skilled user, who can subsequently improve kernel performance with less guessing.

Fengguang Song | Bob Zigon

[1] Jack J. Dongarra,et al. Experiences in autotuning matrix multiplication for energy minimization on GPUs , 2015, Concurr. Comput. Pract. Exp..

[2] Jakub Kurzak,et al. Search Space Pruning Constraints Visualization , 2014, 2014 Second IEEE Working Conference on Software Visualization.

[3] Hiroshi Sasaki,et al. Power and Performance Characterization and Modeling of GPU-Accelerated Systems , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[4] Richard W. Vuduc,et al. Performance Analysis and Tuning for General Purpose Graphics Processing Units (GPGPU) , 2012, Synthesis Lectures on Computer Architecture.

[5] André Seznec,et al. Break down GPU execution time with an analytical method , 2012, RAPIDO '12.

[6] Aurélien Géron,et al. Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems , 2017 .

[7] Jack J. Dongarra,et al. Search Space Generation and Pruning System for Autotuners , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[8] Hyesoon Kim,et al. Performance Analysis and Tuning for General Purpose Graphics Processing Units , 2012 .

[9] Heekuck Oh,et al. Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[10] Michael Bowles. Machine Learning in Python: Essential Techniques for Predictive Analysis , 2015 .

[11] Kevin P. Murphy,et al. Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.