Autotuning GPU Kernels via Static and Predictive Analysis

Optimizing the performance of GPU kernels is challenging for both human programmers and code generators. For example, CUDA programmers must set thread and block parameters for a kernel, but might not have the intuition to make a good choice. Similarly, compilers can generate working code, but may miss tuning opportunities by not targeting GPU models or performing code transformations. Although empirical autotuning addresses some of these challenges, it requires extensive experimentation and search for optimal code variants. This research presents an approach for tuning CUDA kernels based on static analysis that considers fine-grained code structure and the specific GPU architecture features. Notably, our approach does not require any program runs in order to discover near-optimal parameter settings. We demonstrate the applicability of our approach in enabling code autotuners such as Orio to produce competitive code variants comparable with empirical-based methods, without the high cost of experiments.

[1]  Mark Stephenson,et al.  Predicting unroll factors using supervised classification , 2005, International Symposium on Code Generation and Optimization.

[2]  Eric Petit,et al.  CERE: LLVM-Based Codelet Extractor and REplayer for Piecewise Benchmarking and Optimization , 2015, TACO.

[3]  François Bodin,et al.  A Machine Learning Approach to Automatic Production of Compiler Heuristics , 2002, AIMSA.

[4]  William Gropp,et al.  Annotations for Productivity and Performance Portability , 2007 .

[5]  Michael F. P. O'Boyle,et al.  Milepost GCC: Machine Learning Enabled Self-tuning Compiler , 2011, International Journal of Parallel Programming.

[6]  William Jalby,et al.  Is Source-Code Isolation Viable for Performance Characterization? , 2013, 2013 42nd International Conference on Parallel Processing.

[7]  Allen D. Malony,et al.  Identifying Optimization Opportunities Within Kernel Execution in GPU Codes , 2015, Euro-Par Workshops.

[8]  Satoshi Matsuoka,et al.  Auto-tuning 3-D FFT library for CUDA GPUs , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[9]  Isaac D. Scherson,et al.  Computationally Efficient Multiplexing of Events on Hardware Counters , 2014 .

[10]  Nathan R. Tallent,et al.  HPCTOOLKIT: tools for performance analysis of optimized parallel programs , 2010, Concurr. Comput. Pract. Exp..

[11]  Mary W. Hall,et al.  CHiLL : A Framework for Composing High-Level Loop Transformations , 2007 .

[12]  Dong Li,et al.  Application Characterization Using Oxbow Toolkit and PADS Infrastructure , 2014, 2014 Hardware-Software Co-Design for High Performance Computing.

[13]  Allen D. Malony,et al.  Toward multi-target autotuning for accelerators , 2014, 2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS).

[14]  Jack J. Dongarra,et al.  Dense linear algebra solvers for multicore with GPU accelerators , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[15]  Allen D. Malony,et al.  The Tau Parallel Performance System , 2006, Int. J. High Perform. Comput. Appl..

[16]  Sunita Chandrasekaran,et al.  An Analytical Model-Based Auto-tuning Framework for Locality-Aware Loop Scheduling , 2016, ISC.

[17]  P. Sadayappan,et al.  Annotation-based empirical performance tuning using Orio , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[18]  Allen D. Malony,et al.  Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs , 2011, 2011 International Conference on Parallel Processing.

[19]  Boyana Norris,et al.  Autotuning Stencil-Based Computations on GPUs , 2012, 2012 IEEE International Conference on Cluster Computing.