Machine Learning Based Auto-Tuning for Enhanced OpenCL Performance Portability

Heterogeneous computing, which combines devices with different architectures, is rising in popularity, and promises increased performance combined with reduced energy consumption. OpenCL has been proposed as a standard for programing such systems, and offers functional portability. It does, however, suffer from poor performance portability, code tuned for one device must be re-tuned to achieve good performance on another device. In this paper, we use machine learning-based auto-tuning to address this problem. Benchmarks are run on a random subset of the entire tuning parameter configuration space, and the results are used to build an artificial neural network based model. The model can then be used to find interesting parts of the parameter space for further search. We evaluate our method with different benchmarks, on several devices, including an Intel i7 3770 CPU, an Nvidia K40 GPU and an AMD Radeon HD 7970 GPU. Our model achieves a mean relative error as low as 6.1%, and is able to find configurations as little as 1.3% worse than the global minimum.

[1]  Nick Johnson,et al.  Input-aware auto-tuning for directive-based GPU programming , 2013, GPGPU@ASPLOS.

[2]  Anne C. Elster,et al.  Register Caching for Stencil Computations on GPUs , 2014, 2014 16th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing.

[3]  John D. Owens,et al.  GPU Computing , 2008, Proceedings of the IEEE.

[4]  J. Xu OpenCL – The Open Standard for Parallel Programming of Heterogeneous Systems , 2009 .

[5]  Dick H. J. Epema,et al.  Towards Machine Learning-Based Auto-tuning of MapReduce , 2013, 2013 IEEE 21st International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems.

[6]  Margaret Martonosi,et al.  Starchart: Hardware and software optimization using recursive partitioning regression trees , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[7]  Christopher Dyken,et al.  State-of-the-art in heterogeneous computing , 2010, Sci. Program..

[8]  Frank Mueller,et al.  Auto-generation and auto-tuning of 3D stencil codes on GPU clusters , 2012, CGO '12.

[9]  Michael F. P. O'Boyle,et al.  Portable mapping of data parallel programs to OpenCL for heterogeneous systems , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[10]  Xipeng Shen,et al.  A cross-input adaptive framework for GPU program optimizations , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[11]  David D. Cox,et al.  Machine learning for predictive auto-tuning with boosted regression trees , 2012, 2012 Innovative Parallel Computing (InPar).

[12]  Charles K. Bayne,et al.  Multivariate Analysis of Quality. An Introduction , 2001 .

[13]  Jack J. Dongarra,et al.  A Note on Auto-tuning GEMM for GPUs , 2009, ICCS.

[14]  Basilio B. Fraguela,et al.  OCLoptimizer: An Iterative Optimization Tool for OpenCL , 2013, ICCS.

[15]  Sally A. McKee,et al.  Predicting parallel application performance via machine learning approaches , 2007, Concurr. Comput. Pract. Exp..

[16]  Thomas Fahringer,et al.  Automatic problem size sensitive task partitioning on heterogeneous parallel systems , 2013, PPoPP '13.

[17]  Sally A. McKee,et al.  Machine learning based online performance prediction for runtime parallelization and task scheduling , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[18]  Jan Christian Meyer,et al.  Performance modeling of heterogeneous systems , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[19]  Tarek S. Abdelrahman,et al.  Automatic Tuning of Local Memory Use on GPGPUs , 2014, ArXiv.

[20]  Gang Ren,et al.  A comparison of empirical and model-driven optimization , 2003, PLDI '03.

[21]  Mark Stephenson,et al.  Predicting unroll factors using supervised classification , 2005, International Symposium on Code Generation and Optimization.

[22]  Satoshi Matsuoka,et al.  Auto-tuning 3-D FFT library for CUDA GPUs , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[23]  R. C. Whaley,et al.  Minimizing development and maintenance costs in supporting persistently optimized BLAS , 2005, Softw. Pract. Exp..

[24]  Katherine Yelick,et al.  OSKI: A library of automatically tuned sparse matrix kernels , 2005 .

[25]  Hermann Lederer,et al.  Parallel Computing: From Multicores and GPU's to Petascale , 2010 .

[26]  Donggang Liu,et al.  Combating side-channel attacks using key management , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[27]  Eiji Yamanaka,et al.  Predicting Vectorization Profitability Using Binary Classification , 2014, IEICE Trans. Inf. Syst..

[28]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[29]  Sameer Kulkarni,et al.  Mitigating the compiler optimization phase-ordering problem using machine learning , 2012, OOPSLA '12.

[30]  Anne C. Elster,et al.  Auto-tuning a Matrix Routine for High Performance , 2011 .

[31]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[32]  Michael F. P. O'Boyle,et al.  Milepost GCC: Machine Learning Enabled Self-tuning Compiler , 2011, International Journal of Parallel Programming.

[33]  Yao Zhang,et al.  Improving Performance Portability in OpenCL Programs , 2013, ISC.

[34]  Jack Dongarra,et al.  Special Issue on Program Generation, Optimization, and Platform Adaptation , 2005, Proc. IEEE.

[35]  Hyesoon Kim,et al.  An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.

[36]  Stephen A. Jarvis,et al.  An investigation of the performance portability of OpenCL , 2013, J. Parallel Distributed Comput..

[37]  Anne C. Elster,et al.  Modelling Multi-GPU Systems , 2009, PARCO.

[38]  Jan Christian Meyer,et al.  A super-efficient adaptable bit-reversal algorithm for multithreaded architectures , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[39]  Michael F. P. O'Boyle,et al.  Automatic optimization of thread-coarsening for graphics processors , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[40]  Zheng Wang,et al.  Active learning accelerated automatic heuristic construction for parallel program mapping , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[41]  Frank Lindseth,et al.  Medical image segmentation on GPUs - A comprehensive review , 2015, Medical Image Anal..

[42]  Wei Tang,et al.  Ensembling neural networks: Many could be better than all , 2002, Artif. Intell..