Using hardware performance counters to speed up autotuning convergence on GPUs

Nowadays, GPU accelerators are commonly used to speed up general-purpose computing tasks on a variety of hardware. However, due to the diversity of GPU architectures and processed data, optimization of codes for a particular type of hardware and specific data characteristics can be extremely challenging. The autotuning of performance-relevant sourcecode parameters allows for automatic optimization of applications and keeps their performance portable. Although the autotuning process typically results in code speed-up, searching the tuning space can bring unacceptable overhead if (i) the tuning space is vast and full of poorly-performing implementations, or (ii) the autotuning process has to be repeated frequently because of changes in processed data or migration to different hardware. In this paper, we introduce a novel method for searching generic tuning spaces. The tuning spaces can contain tuning parameters changing any user-defined property of the source code. The method takes advantage of collecting hardware performance counters (also known as profiling counters) during empirical tuning. Those counters are used to navigate the searching process towards faster implementations. The method requires the tuning space to be sampled on any GPU. It builds a problem-specific model, which can be used during autotuning on various, even previously unseen inputs or GPUs. Using a set of five benchmarks, we experimentally demonstrate that our method can speed up autotuning when an application needs to be ported to different hardware or when it needs to process data with different characteristics. We also compared our method to state of the art and show that our method is superior in terms of the number of searching steps and typically outperforms other searches in terms of convergence time.

[1]  Margaret Martonosi,et al.  Starchart: Hardware and software optimization using recursive partitioning regression trees , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[2]  Chris Cummins,et al.  End-to-End Deep Learning of Optimization Heuristics , 2017, 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[3]  Yannis Cotronis,et al.  A Practical Performance Model for Compute and Memory Bound GPU Kernels , 2015, 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.

[4]  Derek Chiou,et al.  GPGPU performance and power estimation using machine learning , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[5]  Henk Corporaal,et al.  Transit: A Visual Analytical Model for Multithreaded Machines , 2015, HPDC.

[6]  José María Carazo,et al.  A GPU acceleration of 3-D Fourier reconstruction in cryo-EM , 2019, Int. J. High Perform. Comput. Appl..

[7]  Jan Fousek,et al.  Exploiting historical data: Pruning autotuning spaces and estimating the number of tuning steps , 2020, Concurr. Comput. Pract. Exp..

[8]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[9]  Apan Qasem,et al.  Maximizing Hardware Prefetch Effectiveness with Machine Learning , 2015, 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems.

[10]  Tarek S. Abdelrahman,et al.  A Sampling Based Strategy to Automatic Performance Tuning of GPU Programs , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[11]  Hsien-Hsin S. Lee,et al.  GPUMech: GPU Performance Modeling Technique Based on Interval Analysis , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[12]  Shoaib Kamil,et al.  OpenTuner: An extensible framework for program autotuning , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[13]  Sergei Gorlatch,et al.  ATF: A generic directive‐based auto‐tuning framework , 2019, Concurr. Comput. Pract. Exp..

[14]  Michael F. P. O'Boyle,et al.  Rapidly Selecting Good Compiler Optimizations using Performance Counters , 2007, International Symposium on Code Generation and Optimization (CGO'07).

[15]  Peng Zhang,et al.  Auto-tuning Streamed Applications on Intel Xeon Phi , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[16]  Jiri Filipovic,et al.  Autotuning of OpenCL Kernels with Global Optimizations , 2017, ANDARE '17.

[17]  Simon D. Hammond,et al.  Revisiting Online Autotuning for Sparse-Matrix Vector Multiplication Kernels on Next-Generation Architectures , 2017, 2017 IEEE 19th International Conference on High Performance Computing and Communications; IEEE 15th International Conference on Smart City; IEEE 3rd International Conference on Data Science and Systems (HPCC/SmartCity/DSS).

[18]  Siegfried Benkner,et al.  A Benchmark Set of Highly-efficient CUDA and OpenCL Kernels and its Dynamic Autotuning with Kernel Tuning Toolkit , 2020, Future Gener. Comput. Syst..

[19]  Cedric Nugteren,et al.  CLTune: A Generic Auto-Tuner for OpenCL Kernels , 2015, 2015 IEEE 9th International Symposium on Embedded Multicore/Many-core Systems-on-Chip.

[20]  Vasily Volkov,et al.  Understanding Latency Hiding on GPUs , 2016 .

[21]  Yao Zhang,et al.  A quantitative performance analysis model for GPU architectures , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[22]  Jack J. Dongarra,et al.  Autotuning GEMM Kernels for the Fermi GPU , 2012, IEEE Transactions on Parallel and Distributed Systems.

[23]  Amin Nezarat,et al.  Searching CUDA code autotuning spaces with hardware performance counters: data from benchmarks running on various GPU architectures , 2021, Data in Brief.

[24]  Anne C. Elster,et al.  Machine Learning Based Auto-Tuning for Enhanced OpenCL Performance Portability , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium Workshop.

[25]  Shiao-Li Tsao,et al.  Efficient and Portable Workgroup Size Tuning , 2020, IEEE Transactions on Parallel and Distributed Systems.

[26]  Simon McIntosh-Smith,et al.  Improving Auto-Tuning Convergence Times with Dynamically Generated Predictive Performance Models , 2015, 2015 IEEE 9th International Symposium on Embedded Multicore/Many-core Systems-on-Chip.

[27]  Jack J. Dongarra,et al.  A comparison of search heuristics for empirical code optimization , 2008, 2008 IEEE International Conference on Cluster Computing.

[28]  Prasanna Balaprakash,et al.  Exploiting Performance Portability in Search Algorithms for Autotuning , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[29]  Eduardo Cesar Galobardes,et al.  Automatic Tuning of HPC Applications. The Periscope Tuning Framework , 2015 .

[30]  Apan Qasem,et al.  Automatically Selecting Profitable Thread Block Sizes for Accelerated Kernels , 2017, 2017 IEEE 19th International Conference on High Performance Computing and Communications; IEEE 15th International Conference on Smart City; IEEE 3rd International Conference on Data Science and Systems (HPCC/SmartCity/DSS).

[31]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[32]  David Cox,et al.  Input-Aware Auto-Tuning of Compute-Bound HPC Kernels , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[33]  Ben van Werkhoven,et al.  Kernel Tuner: A search-optimizing GPU code auto-tuner , 2019, Future Gener. Comput. Syst..

[34]  Cees T. A. M. de Laat,et al.  The landscape of GPGPU performance modeling tools , 2016, Parallel Comput..

[35]  Sally A. McKee,et al.  Prediction-based power estimation and scheduling for CMPs , 2009, ICS '09.

[36]  Hyesoon Kim,et al.  An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.

[37]  Anna Sikora,et al.  Hardware Counters' Space Reduction for Code Region Characterization , 2019, Euro-Par.

[38]  Cedric Nugteren,et al.  CLBlast: A Tuned OpenCL BLAS Library , 2017, IWOCL.

[39]  Prasanna Balaprakash,et al.  Can search algorithms save large-scale automatic performance tuning? , 2011, ICCS.