Autotuning in High-Performance Computing Applications

Autotuning refers to the automatic generation of a search space of possible implementations of a computation that are evaluated through models and/or empirical measurement to identify the most desirable implementation. Autotuning has the potential to dramatically improve the performance portability of petascale and exascale applications. To date, autotuning has been used primarily in high-performance applications through tunable libraries or previously tuned application code that is integrated directly into the application. This paper draws on the authors’ extensive experience applying autotuning to high-performance applications, describing both successes and future challenges. If autotuning is to be widely used in the HPC community, researchers must address the software engineering challenges, manage configuration overheads, and continue to demonstrate significant performance gains and portability across architectures. In particular, tools that configure the application must be integrated into the application build process so that tuning can be reapplied as the application and target architectures evolve.

[1]  Richard D. Hornung,et al.  The RAJA Portability Layer: Overview and Status , 2014 .

[2]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[3]  Matteo Frigo A Fast Fourier Transform Compiler , 1999, PLDI.

[4]  Victor Eijkhout,et al.  Proof-Driven Derivation of Krylov Solver Libraries , 2010 .

[5]  Jimeng Sun,et al.  Model-Driven Sparse CP Decomposition for Higher-Order Tensors , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[6]  Chun Chen,et al.  Loop Transformation Recipes for Code Generation and Auto-Tuning , 2009, LCPC.

[7]  P. Sadayappan,et al.  Stencil-Aware GPU Optimization of Iterative Solvers , 2013, SIAM J. Sci. Comput..

[8]  Michael F. P. O'Boyle,et al.  Combined Selection of Tile Sizes and Unroll Factors Using Iterative Compilation , 2000, Proceedings 2000 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00622).

[9]  Chun Chen,et al.  Autotuning and Specialization: Speeding up Matrix Multiply for Small Matrices with Compiler Technology , 2010, Software Automatic Tuning, From Concepts to State-of-the-Art Results.

[10]  Ananta Tiwari,et al.  Online Adaptive Code Generation and Tuning , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[11]  Samuel Williams,et al.  Optimizing Sparse Matrix-Multiple Vectors Multiplication for Nuclear Configuration Interaction Calculations , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[12]  Allen D. Malony,et al.  Design and implementation of a parallel performance data management framework , 2005, 2005 International Conference on Parallel Processing (ICPP'05).

[13]  William J. Dally,et al.  A tuning framework for software-managed memory hierarchies , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[14]  Michael Garland,et al.  Nitro: A Framework for Adaptive Code Variant Tuning , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[15]  Samuel Williams,et al.  Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors , 2007, SIAM Rev..

[16]  James Demmel,et al.  Author retrospective for optimizing matrix multiply using PHiPAC: a portable high-performance ANSI C coding methodology , 2014, ICS 25th Anniversary.

[17]  Jack J. Dongarra,et al.  A proposal for a set of level 3 basic linear algebra subprograms , 1987, SGNM.

[18]  David E. Bernholdt,et al.  Synthesis of High-Performance Parallel Programs for a Class of ab Initio Quantum Chemistry Models , 2005, Proceedings of the IEEE.

[19]  David D. Cox,et al.  Machine learning for predictive auto-tuning with boosted regression trees , 2012, 2012 Innovative Parallel Computing (InPar).

[20]  Bradley C. Kuszmaul,et al.  The pochoir stencil compiler , 2011, SPAA '11.

[21]  Katherine A. Yelick,et al.  Optimizing Sparse Matrix Vector Multiplication on SMP , 1999, SIAM Conference on Parallel Processing for Scientific Computing.

[22]  Daniel Sunderland,et al.  Kokkos, a Manycore Device Performance Portability Library for C++ HPC Applications , 2014 .

[23]  Prasanna Balaprakash,et al.  Can search algorithms save large-scale automatic performance tuning? , 2011, ICCS.

[24]  Prasanna Balaprakash,et al.  Exploiting Performance Portability in Search Algorithms for Autotuning , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[25]  Martin Schulz,et al.  Caliper: Performance Introspection for HPC Software Stacks , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[26]  Elizabeth R. Jessup,et al.  Lighthouse: an automated solver selection tool , 2015, SE-HPCCSE@SC.

[27]  Chun Chen,et al.  Model-guided empirical optimization for memory hierarchy , 2007 .

[28]  P. Sadayappan,et al.  Annotation-based empirical performance tuning using Orio , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[29]  Hiroaki Kobayashi,et al.  Xevolver: An XML-based code translation framework for supporting HPC application migration , 2014, 2014 21st International Conference on High Performance Computing (HiPC).

[30]  Charles L. Lawson,et al.  Basic Linear Algebra Subprograms for Fortran Usage , 1979, TOMS.

[31]  Vahid Tabatabaee,et al.  Parallel Parameter Tuning for Applications with Performance Variability , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[32]  James Demmel,et al.  Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.

[33]  Prasanna Balaprakash,et al.  An Experimental Study of Global and Local Search Algorithms in Empirical Performance Tuning , 2012, VECPAR.

[34]  Richard W. Vuduc,et al.  A Sparse Direct Solver for Distributed Memory Xeon Phi-Accelerated Systems , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[35]  Steven G. Johnson,et al.  FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[36]  Prasanna Balaprakash,et al.  Generating Efficient Tensor Contractions for GPUs , 2015, 2015 44th International Conference on Parallel Processing.

[37]  Frédo Durand,et al.  Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.

[38]  Boyana Norris,et al.  Autotuning Stencil-Based Computations on GPUs , 2012, 2012 IEEE International Conference on Cluster Computing.

[39]  I-Hsin Chung,et al.  A Case Study Using Automatic Performance Tuning for Large-Scale Scientific Programs , 2006, 2006 15th IEEE International Conference on High Performance Distributed Computing.

[40]  Jack J. Dongarra,et al.  A comparison of search heuristics for empirical code optimization , 2008, 2008 IEEE International Conference on Cluster Computing.

[41]  Alan Edelman,et al.  PetaBricks: a language and compiler for algorithmic choice , 2009, PLDI '09.

[42]  Elizabeth R. Jessup,et al.  Performance-Based Numerical Solver Selection in the Lighthouse Framework , 2016, SIAM J. Sci. Comput..

[43]  Alexander Heinecke,et al.  LIBXSMM: Accelerating Small Matrix Multiplications by Runtime Code Generation , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[44]  Michael Voss,et al.  High-level adaptive program optimization with ADAPT , 2001, PPoPP '01.

[45]  Ignacio Laguna,et al.  Apollo: Reusable Models for Fast, Dynamic Tuning of Input-Dependent Code , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[46]  David A. Padua,et al.  A Language for the Compact Representation of Multiple Program Versions , 2005, LCPC.

[47]  Paolo Bientinesi,et al.  Application-tailored linear algebra algorithms , 2012, Int. J. High Perform. Comput. Appl..

[48]  Kalyan Veeramachaneni,et al.  Autotuning algorithmic choice for input sensitivity , 2015, PLDI.

[49]  Richard W. Vuduc,et al.  POET: Parameterized Optimizations for Empirical Tuning , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[50]  Bronis R. de Supinski,et al.  The Spack package manager: bringing order to HPC software chaos , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[51]  William Gropp,et al.  Efficient Management of Parallelism in Object-Oriented Numerical Software Libraries , 1997, SciTools.

[52]  Ken Kennedy,et al.  Automatic tuning of whole applications using direct search and a performance-based transformation system , 2006, The Journal of Supercomputing.

[53]  Elizabeth R. Jessup,et al.  Lighthouse: a taxonomy-based solver selection tool , 2015, SEPS@SPLASH.

[54]  E. Im,et al.  Optimizing Sparse Matrix Vector Multiplication on SMP , 1999, PPSC.

[55]  Jeffrey K. Hollingsworth,et al.  Computation-communication overlap and parameter auto-tuning for scalable parallel 3-D FFT , 2016, J. Comput. Sci..

[56]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2009, Parallel Comput..

[57]  Tamara G. Kolda,et al.  An overview of the Trilinos project , 2005, TOMS.

[58]  Chun Chen,et al.  Auto-tuning full applications: A case study , 2011, Int. J. High Perform. Comput. Appl..

[59]  Chun Chen,et al.  Speeding up Nek5000 with autotuning and specialization , 2010, ICS '10.

[60]  Barton P. Miller,et al.  Dynamic program instrumentation for scalable performance tools , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.

[61]  Franz Franchetti,et al.  SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.

[62]  Helmar Burkhart,et al.  PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[63]  Samuel Williams,et al.  Optimization of a lattice Boltzmann computation on state-of-the-art multicore platforms , 2009, J. Parallel Distributed Comput..

[64]  William Gropp,et al.  Annotations for Productivity and Performance Portability , 2007 .

[65]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[66]  Khalid Ahmad,et al.  Optimizing LOBPCG: Sparse Matrix Loop and Data Transformations in Action , 2016, LCPC.

[67]  Chun Chen,et al.  A scalable auto-tuning framework for compiler optimization , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[68]  Katherine Yelick,et al.  OSKI: A library of automatically tuned sparse matrix kernels , 2005 .

[69]  Boyana Norris,et al.  Generating Customized Sparse Eigenvalue Solutions with Lighthouse , 2014 .

[70]  Jack J. Dongarra,et al.  Autotuning GEMM Kernels for the Fermi GPU , 2012, IEEE Transactions on Parallel and Distributed Systems.

[71]  Kunle Olukotun,et al.  A Heterogeneous Parallel Framework for Domain-Specific Languages , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.