GPU code optimization using abstract kernel emulation and sensitivity analysis

In this paper, we develop an approach to GPU kernel optimization by focusing on identification of bottleneck resources and determining optimization parameters that can alleviate the bottleneck. Performance modeling for GPUs is done by abstract kernel emulation along with latency/gap modeling of resources. Sensitivity analysis with respect to resource latency/gap parameters is used to predict the bottleneck resource for a given kernel's execution. The utility of the bottleneck analysis is demonstrated in two contexts: 1) Coupling the new bottleneck-driven optimization strategy with the OpenTuner auto-tuner: experimental results on all kernels from the Rodinia suite and GPU tensor contraction kernels from the NWChem computational chemistry suite demonstrate effectiveness. 2) Manual code optimization: two case studies illustrate the use of the bottleneck analysis to iteratively improve the performance of code from state-of-the-art domain-specific code generators.

[1]  Mingyu Chen,et al.  Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning , 2017, PPoPP.

[2]  Apan Qasem,et al.  Automatic Restructuring of GPU Kernels for Exploiting Inter-thread Data Locality , 2012, CC.

[3]  Sriram Krishnamoorthy,et al.  Optimizing tensor contraction expressions for hybrid CPU-GPU execution , 2013, Cluster Computing.

[4]  Albert Cohen,et al.  Hybrid Hexagonal/Classical Tiling for GPUs , 2014, CGO '14.

[5]  Michael F. P. O'Boyle,et al.  Automatic optimization of thread-coarsening for graphics processors , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[6]  Jianliang Xu,et al.  GPURoofline: A Model for Guiding Performance Optimizations on GPUs , 2012, Euro-Par.

[7]  Richard W. Vuduc,et al.  A performance analysis framework for identifying potential benefits in GPGPU applications , 2012, PPoPP '12.

[8]  André Seznec,et al.  Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[9]  P. Sadayappan,et al.  Resource conscious reuse-driven tiling for GPUs , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).

[10]  Christian Terboven,et al.  OpenACC - First Experiences with Real-World Applications , 2012, Euro-Par.

[11]  Mike O'Connor,et al.  Cache-Conscious Wavefront Scheduling , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[12]  Paulius Micikevicius,et al.  Fusing convolution kernels through tiling , 2015, ARRAY@PLDI.

[13]  Jack Dongarra,et al.  Report on the Sunway TaihuLight System , 2016 .

[14]  Yao Zhang,et al.  A quantitative performance analysis model for GPU architectures , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[15]  Guangwen Yang,et al.  Taming the "Monster": Overcoming Program Optimization Challenges on SW26010 Through Precise Performance Modeling , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[16]  Tjerk P. Straatsma,et al.  NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations , 2010, Comput. Phys. Commun..

[17]  GPU code optimization using abstract kernel emulation and sensitivity analysis , 2018, PLDI.

[18]  J. Ramanujam,et al.  Automatic C-to-CUDA Code Generation for Affine Programs , 2010, CC.

[19]  Uday Bondhugula,et al.  A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.

[20]  Helmar Burkhart,et al.  PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[21]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[22]  Hyesoon Kim,et al.  An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.

[23]  Francky Catthoor,et al.  Polyhedral parallel code generation for CUDA , 2013, TACO.

[24]  William Gropp,et al.  An adaptive performance modeling tool for GPU architectures , 2010, PPoPP '10.

[25]  P. Sadayappan,et al.  High-performance code generation for stencil computations on GPU architectures , 2012, ICS '12.

[26]  Mark N. Wegman,et al.  Constant propagation with conditional branches , 1985, POPL.

[27]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[28]  Sriram Krishnamoorthy,et al.  Acceleration of Streamed Tensor Contraction Expressions on GPGPU-Based Clusters , 2010, 2010 IEEE International Conference on Cluster Computing.

[29]  Albert Cohen,et al.  Split tiling for GPUs: automatic parallelization using trapezoidal tiles , 2013, GPGPU@ASPLOS.

[30]  Seyong Lee,et al.  OpenARC: open accelerator research compiler for directive-based, efficient heterogeneous computing , 2014, HPDC '14.

[31]  Isaiah Shavitt,et al.  Many-Body Methods in Chemistry and Physics: MBPT and Coupled-Cluster Theory , 2009 .

[32]  Michael F. P. O'Boyle,et al.  A large-scale cross-architecture evaluation of thread-coarsening , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[33]  Seyong Lee,et al.  COMPASS: A Framework for Automated Performance Modeling and Prediction , 2015, ICS.

[34]  Chaowei Wang,et al.  A performance analysis framework for exploiting GPU microarchitectural capability , 2017, ICS '17.

[35]  Shoaib Kamil,et al.  OpenTuner: An extensible framework for program autotuning , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).