A Benchmark Set of Highly-efficient CUDA and OpenCL Kernels and its Dynamic Autotuning with Kernel Tuning Toolkit

Autotuning of performance-relevant source-code parameters allows to automatically tune applications without hard coding optimizations and thus helps with keeping the performance portable. In this paper, we introduce a benchmark set of ten autotunable kernels for important computational problems implemented in OpenCL or CUDA. Using our Kernel Tuning Toolkit, we show that with autotuning most of the kernels reach near-peak performance on various GPUs and outperform baseline implementations on CPUs and Xeon Phis. Our evaluation also demonstrates that autotuning is key to performance portability. In addition to offline tuning, we also introduce dynamic autotuning of code optimization parameters during application runtime. With dynamic tuning, the Kernel Tuning Toolkit enables applications to re-tune performance-critical kernels at runtime whenever needed, for example, when input data changes. Although it is generally believed that autotuning spaces tend to be too large to be searched during application runtime, we show that it is not necessarily the case when tuning spaces are designed rationally. Many of our kernels reach near peak-performance with moderately sized tuning spaces that can be searched at runtime with acceptable overhead. Finally we demonstrate, how dynamic performance tuning can be integrated into a real-world application from cryo-electron microscopy domain.

[1]  José María Carazo,et al.  A fast iterative convolution weighting approach for gridding-based direct Fourier three-dimensional reconstruction with correction for the contrast transfer function. , 2015, Ultramicroscopy.

[2]  Christoph Kessler,et al.  Towards a Tunable Multi-Backend Skeleton Programming Framework for Multi-GPU Systems , 2012 .

[3]  Jiri Filipovic,et al.  Autotuning of OpenCL Kernels with Global Optimizations , 2017, ANDARE '17.

[4]  Ben van Werkhoven,et al.  Kernel Tuner: A search-optimizing GPU code auto-tuner , 2019, Future Gener. Comput. Syst..

[5]  Sergei Gorlatch,et al.  ATF: A Generic Auto-Tuning Framework , 2017, 2017 IEEE 19th International Conference on High Performance Computing and Communications; IEEE 15th International Conference on Smart City; IEEE 3rd International Conference on Data Science and Systems (HPCC/SmartCity/DSS).

[6]  Michael Garland,et al.  Nitro: A Framework for Adaptive Code Variant Tuning , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[7]  Jack J. Dongarra,et al.  A comparison of search heuristics for empirical code optimization , 2008, 2008 IEEE International Conference on Cluster Computing.

[8]  Prasanna Balaprakash,et al.  Autotuning in High-Performance Computing Applications , 2018, Proceedings of the IEEE.

[9]  Lifan Xu,et al.  Auto-tuning a high-level language targeted to GPU codes , 2012, 2012 Innovative Parallel Computing (InPar).

[10]  Shoaib Kamil,et al.  OpenTuner: An extensible framework for program autotuning , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[11]  Ludek Matyska,et al.  Optimizing CUDA code by kernel fusion: application on BLAS , 2013, The Journal of Supercomputing.

[12]  Michael Gerndt,et al.  Tuning OpenCL Applications with the Periscope Tuning Framework , 2016, 2016 49th Hawaii International Conference on System Sciences (HICSS).

[13]  D S Goodsell,et al.  Automated docking of flexible ligands: Applications of autodock , 1996, Journal of molecular recognition : JMR.

[14]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[15]  Jack J. Dongarra,et al.  High-Performance Matrix-Matrix Multiplications of Very Small Matrices , 2016, Euro-Par.

[16]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[17]  Stanislav G. Sedukhin,et al.  Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[18]  Collin McCurdy,et al.  The Scalable Heterogeneous Computing (SHOC) benchmark suite , 2010, GPGPU-3.

[19]  José María Carazo,et al.  A GPU acceleration of 3-D Fourier reconstruction in cryo-EM , 2019, Int. J. High Perform. Comput. Appl..

[20]  Gianluca Palermo,et al.  SOCRATES — A seamless online compiler and system runtime autotuning framework for energy-aware applications , 2018, 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[21]  Eduardo Cesar Galobardes,et al.  Automatic Tuning of HPC Applications. The Periscope Tuning Framework , 2015 .

[22]  Anna Sikora,et al.  AutoTune: A Plugin-Driven Approach to the Automatic Tuning of Parallel Applications , 2012, PARA.

[23]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[24]  Cedric Nugteren,et al.  CLTune: A Generic Auto-Tuner for OpenCL Kernels , 2015, 2015 IEEE 9th International Symposium on Embedded Multicore/Many-core Systems-on-Chip.

[25]  John K. Reid,et al.  The Multifrontal Solution of Indefinite Sparse Symmetric Linear , 1983, TOMS.

[26]  Jack J. Dongarra,et al.  A Note on Auto-tuning GEMM for GPUs , 2009, ICCS.

[27]  Yiqun Liu,et al.  MPFFT: An Auto-Tuning FFT Library for OpenCL GPUs , 2013, Journal of Computer Science and Technology.

[28]  Simon D. Hammond,et al.  Revisiting Online Autotuning for Sparse-Matrix Vector Multiplication Kernels on Next-Generation Architectures , 2017, 2017 IEEE 19th International Conference on High Performance Computing and Communications; IEEE 15th International Conference on Smart City; IEEE 3rd International Conference on Data Science and Systems (HPCC/SmartCity/DSS).

[29]  Michel Steuwer,et al.  LIFT: A functional data-parallel IR for high-performance GPU code generation , 2017, 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[30]  Chun Chen,et al.  A Programming Language Interface to Describe Transformations and Code Generation , 2010, LCPC.

[31]  Dominik Grewe,et al.  Automatically generating and tuning GPU code for sparse matrix-vector multiplication from a high-level representation , 2011, GPGPU-4.

[32]  Klaus Schulten,et al.  Accelerating Molecular Modeling Applications with GPU Computing , 2009 .

[33]  Chris Cummins,et al.  End-to-End Deep Learning of Optimization Heuristics , 2017, 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[34]  Ananta Tiwari,et al.  Online Adaptive Code Generation and Tuning , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[35]  Sergei Gorlatch,et al.  ATF: A generic directive‐based auto‐tuning framework , 2019, Concurr. Comput. Pract. Exp..

[36]  Michael Garland,et al.  Architecture-Adaptive Code Variant Tuning , 2016, ASPLOS.

[37]  Karl Ljungkvist Matrix-Free Finite-Element Operator Application on Graphics Processing Units , 2014, Euro-Par Workshops.

[38]  Anne C. Elster,et al.  Machine Learning Based Auto-Tuning for Enhanced OpenCL Performance Portability , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium Workshop.

[39]  Matthew L. Baker,et al.  An atomic model of brome mosaic virus using direct electron detection and real-space optimization , 2014, Nature Communications.

[40]  Wen-mei W. Hwu,et al.  Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .

[41]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[42]  Michael F. P. O'Boyle,et al.  Combined Selection of Tile Sizes and Unroll Factors Using Iterative Compilation , 2004, The Journal of Supercomputing.

[43]  Jack J. Dongarra,et al.  Autotuning GEMM Kernels for the Fermi GPU , 2012, IEEE Transactions on Parallel and Distributed Systems.

[44]  Anna Sikora,et al.  A multi-aspect online tuning framework for HPC applications , 2017, Software Quality Journal.

[45]  Jack J. Dongarra,et al.  A set of level 3 basic linear algebra subprograms , 1990, TOMS.

[46]  Siegfried Benkner,et al.  Automatic Performance Tuning of Pipeline Patterns for Heterogeneous Parallel Architectures , 2014 .