Automatic optimization of thread-coarsening for graphics processors

OpenCL has been designed to achieve functional portability across multi-core devices from different vendors. However, the lack of a single cross-target optimizing compiler severely limits performance portability of OpenCL programs. Programmers need to manually tune applications for each specific device, preventing effective portability. We target a compiler transformation specific for data-parallel languages: thread-coarsening and show it can improve performance across different GPU devices. We then address the problem of selecting the best value for the coarsening factor parameter, i.e., deciding how many threads to merge together. We experimentally show that this is a hard problem to solve: good configurations are difficult to find and naive coarsening in fact leads to substantial slowdowns. We propose a solution based on a machine-learning model that predicts the best coarsening factor using kernel-function static features. The model automatically specializes to the different architectures considered. We evaluate our approach on 17 benchmarks on four devices: two Nvidia GPUs and two different generations of AMD GPUs. Using our technique, we achieve speedups between 1.11× and 1.33× on average.

[1]  Yi Yang,et al.  A unified optimizing compiler framework for different GPGPU architectures , 2012, TACO.

[2]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[3]  Michael F. P. O'Boyle,et al.  Adaptive java optimisation using instance-based learning , 2004, ICS '04.

[4]  B. Manly Multivariate Statistical Methods : A Primer , 1986 .

[5]  Scott A. Mahlke,et al.  Sponge: portable stream programming on graphics engines , 2011, ASPLOS XVI.

[6]  Wen-mei W. Hwu,et al.  Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.

[7]  Sudhakar Yalamanchili,et al.  Modeling GPU-CPU workloads and systems , 2010, GPGPU-3.

[8]  Xipeng Shen,et al.  A cross-input adaptive framework for GPU program optimizations , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[9]  Michael F. P. O'Boyle,et al.  A large-scale cross-architecture evaluation of thread-coarsening , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[10]  Fernando Magno Quintão Pereira,et al.  Divergence Analysis and Optimizations , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[11]  Michael F. P. O'Boyle,et al.  Exploiting GPU Hardware Saturation for Fast Compiler Optimization , 2014, GPGPU@ASPLOS.

[12]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, HiPC 2008.

[13]  Richard W. Vuduc,et al.  A performance analysis framework for identifying potential benefits in GPGPU applications , 2012, PPoPP '12.

[14]  Yi Yang,et al.  Exploiting uniform vector instructions for GPGPU performance, energy efficiency, and opportunistic reliability enhancement , 2013, ICS '13.

[15]  P. Sadayappan,et al.  Using machine learning to improve automatic vectorization , 2012, TACO.

[16]  Simon Moll Decompilation of LLVM IR , 2012 .

[17]  David F. Bacon,et al.  Compiling a high-level language for GPUs: (via language support for architectures and compilers) , 2012, PLDI.

[18]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[19]  Mark Stephenson,et al.  Predicting unroll factors using supervised classification , 2005, International Symposium on Code Generation and Optimization.

[20]  Sebastian Hack,et al.  Improving Performance of OpenCL on CPUs , 2012, CC.

[21]  Margaret Martonosi,et al.  Starchart: Hardware and software optimization using recursive partitioning regression trees , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[22]  Richard W. Vuduc,et al.  Model-driven autotuning of sparse matrix-vector multiply on GPUs , 2010, PPoPP '10.

[23]  Sally A. McKee,et al.  Methods of inference and learning for performance modeling of parallel applications , 2007, PPoPP.

[24]  Xipeng Shen,et al.  On-the-fly elimination of dynamic irregularities for GPU computing , 2011, ASPLOS XVI.

[25]  José Nelson Amaral,et al.  Using machines to learn method-specific compilation strategies , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[26]  Michael F. P. O'Boyle,et al.  Fast compiler optimisation evaluation using code-feature based performance prediction , 2007, CF '07.

[27]  Michael F. P. O'Boyle,et al.  Partitioning streaming parallelism for multi-cores: A machine learning based approach , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[28]  Naga K. Govindaraju,et al.  Auto-tuning of fast fourier transform on graphics processors , 2011, PPoPP '11.

[29]  Bryan F. J. Manly,et al.  Multivariate Statistical Methods: A Primer, Third Edition , 1994 .

[30]  References , 1971 .

[31]  Krste Asanovic,et al.  Convergence and scalarization for data-parallel architectures , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[32]  Apan Qasem,et al.  Automatic Restructuring of GPU Kernels for Exploiting Inter-thread Data Locality , 2012, CC.