Improving Auto-Tuning Convergence Times with Dynamically Generated Predictive Performance Models

Automatic performance tuning is becoming an increasingly valuable tool for improving performance portability when targeting diverse ranges of processor architectures. Much of the existing work to develop auto-tuning techniques focuses solely on achieving the best possible performance, with little attention paid to the amount of time required to perform the tuning process itself. As developers begin to face progressively larger sets of target platforms, the amount of tuning time required to achieve performance goals for each platform will be a crucial factor in determining the success of different auto-tuning techniques. In this work, we describe a hybrid approach to auto-tuning that combines empirical sampling and a predictive performance model, with the goal of reducing the time needed to converge on the optimal (or near-optimal) configuration. Our approach is shown to provide a three-fold reduction in the amount of tuning time required to achieve performance within 10% of the global optimum.

[1]  James Demmel,et al.  Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.

[2]  Frédo Durand,et al.  Decoupling algorithms from schedules for easy optimization of image processing pipelines , 2012, ACM Trans. Graph..

[3]  Jianbin Fang,et al.  An Auto-tuning Solution to Data Streams Clustering in OpenCL , 2011, 2011 14th IEEE International Conference on Computational Science and Engineering.

[4]  José M. F. Moura,et al.  Spiral: A Generator for Platform-Adapted Libraries of Signal Processing Alogorithms , 2004, Int. J. High Perform. Comput. Appl..

[5]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[6]  Margaret Martonosi,et al.  Starchart: Hardware and software optimization using recursive partitioning regression trees , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[7]  Jack J. Dongarra,et al.  A Note on Auto-tuning GEMM for GPUs , 2009, ICCS.

[8]  Katherine Yelick,et al.  OSKI: A library of automatically tuned sparse matrix kernels , 2005 .

[9]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[10]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[11]  Yiqun Liu,et al.  MPFFT: An Auto-Tuning FFT Library for OpenCL GPUs , 2013, Journal of Computer Science and Technology.

[12]  Chun Chen,et al.  Autotuning and Specialization: Speeding up Matrix Multiply for Small Matrices with Compiler Technology , 2010, Software Automatic Tuning, From Concepts to State-of-the-Art Results.

[13]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[14]  Christopher J. Fluke,et al.  Accelerating incoherent dedispersion , 2012, 1201.5380.

[15]  Steven G. Johnson,et al.  FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[16]  Margaret Martonosi,et al.  Stargazer: Automated regression-based GPU design space exploration , 2012, 2012 IEEE International Symposium on Performance Analysis of Systems & Software.

[17]  David D. Cox,et al.  Machine learning for predictive auto-tuning with boosted regression trees , 2012, 2012 Innovative Parallel Computing (InPar).

[18]  Frédo Durand,et al.  Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.

[19]  Shoaib Kamil,et al.  OpenTuner: An extensible framework for program autotuning , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[20]  Stanislav G. Sedukhin,et al.  Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[21]  Prasanna Balaprakash,et al.  Active-learning-based surrogate models for empirical performance tuning , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).

[22]  Michael F. P. O'Boyle,et al.  A large-scale cross-architecture evaluation of thread-coarsening , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[23]  Katherine A. Yelick,et al.  Optimizing Sparse Matrix Computations for Register Reuse in SPARSITY , 2001, International Conference on Computational Science.