Towards making autotuning mainstream

Autotuning systems employ empirical techniques to evaluate the suitability of a search space of possible implementations of a computation. Autotuning has emerged as a critical strategy for achieving high performance as architectural complexity grows. Present-day autotuning technology augments the capabilities of expert users or is hidden inside compilers, but to date has not been adopted as a mainstream technology. Based on our prior experience and the experience of others in developing autotuning technology and applying it to libraries and applications, this paper examines some of the barriers to adoption of the technology and future research areas to break down these barriers.

[1]  Albert Cohen,et al.  Iterative optimization in the polyhedral model: part ii, multidimensional time , 2008, PLDI '08.

[2]  Richard Johnson,et al.  Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization , 2003 .

[3]  Chun Chen,et al.  A Programming Language Interface to Describe Transformations and Code Generation , 2010, LCPC.

[4]  Chun Chen,et al.  Combining models and guided empirical search to optimize for multiple levels of the memory hierarchy , 2005, International Symposium on Code Generation and Optimization.

[5]  Una-May O'Reilly,et al.  An efficient evolutionary algorithm for solving incrementally structured problems , 2011, GECCO '11.

[6]  Geri Georg,et al.  Set and Relation Manipulation for the Sparse Polyhedral Framework , 2012, LCPC.

[7]  Samuel Williams,et al.  An auto-tuning framework for parallel multicore stencil computations , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[8]  Chun Chen,et al.  Model-guided empirical optimization for memory hierarchy , 2007 .

[9]  Michel Lemaître,et al.  Branch and Bound Algorithm Selection by Performance Prediction , 1998, AAAI/IAAI.

[10]  Henry Kautz,et al.  Branch and bound algorithm selection by performance prediction , 2001, Conference on Uncertainty in Artificial Intelligence.

[11]  William Jalby,et al.  Loop Optimization using Hierarchical Compilation and Kernel Decomposition , 2007, International Symposium on Code Generation and Optimization (CGO'07).

[12]  I-Hsin Chung,et al.  Active Harmony: Towards Automated Performance Tuning , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[13]  Nancy M. Amato,et al.  A framework for adaptive algorithm selection in STAPL , 2005, PPoPP.

[14]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[15]  Chun Chen,et al.  Speeding up Nek5000 with autotuning and specialization , 2010, ICS '10.

[16]  Samuel Williams,et al.  Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures , 2008 .

[17]  Uday Bondhugula,et al.  A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.

[18]  J. Ramanujam,et al.  Automatic C-to-CUDA Code Generation for Affine Programs , 2010, CC.

[19]  Franz Franchetti,et al.  SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.

[20]  Helmar Burkhart,et al.  PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[21]  Bradley C. Kuszmaul,et al.  The pochoir stencil compiler , 2011, SPAA '11.

[22]  Michael Wolfe,et al.  Loops skewing: The wavefront method revisited , 1986, International Journal of Parallel Programming.

[23]  James Demmel,et al.  Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.

[24]  Albert Cohen,et al.  Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time , 2007, International Symposium on Code Generation and Optimization (CGO'07).

[25]  Archana Ganapathi,et al.  A case for machine learning to optimize multicore performance , 2009 .

[26]  Qing Yi,et al.  POET: a scripting language for applying parameterized source‐to‐source program transformations , 2012, Softw. Pract. Exp..

[27]  P. Sadayappan,et al.  High-performance code generation for stencil computations on GPU architectures , 2012, ICS '12.

[28]  Michela Milano,et al.  Learning Techniques for Automatic Algorithm Portfolio Selection , 2004, ECAI.

[29]  Markus Püschel,et al.  Computer Generation of General Size Linear Transform Libraries , 2009, 2009 International Symposium on Code Generation and Optimization.

[30]  Chun Chen,et al.  Improving High-Performance Sparse Libraries Using Compiler-Assisted Specialization: A PETSc Case Study , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[31]  Ken Kennedy,et al.  Profitable loop fusion and tiling using model-driven empirical search , 2006, ICS '06.

[32]  Larry Carter,et al.  Rescheduling for Locality in Sparse Matrix Computations , 2001, International Conference on Computational Science.

[33]  Matteo Frigo A Fast Fourier Transform Compiler , 1999, PLDI.

[34]  Uday Bondhugula,et al.  Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories , 2008, PPoPP.

[35]  Chun Chen,et al.  Auto-tuning full applications: A case study , 2011, Int. J. High Perform. Comput. Appl..

[36]  Haipeng Guo A Bayesian Approach for Automatic Algorithm Selection , 2003 .

[37]  Saman P. Amarasinghe,et al.  Meta optimization: improving compiler heuristics with machine learning , 2003, PLDI '03.

[38]  David A. Padua,et al.  Optimizing sorting with genetic algorithms , 2005, International Symposium on Code Generation and Optimization.

[39]  Uday Bondhugula,et al.  A compiler framework for optimization of affine loop nests for gpgpus , 2008, ICS '08.

[40]  David Parello,et al.  Semi-Automatic Composition of Loop Transformations for Deep Parallelism and Memory Hierarchies , 2006, International Journal of Parallel Programming.

[41]  Robert Glück A self‐applicable online partial evaluator for recursive flowchart languages , 2012, Softw. Pract. Exp..

[42]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[43]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[44]  Ananta Tiwari,et al.  Online Adaptive Code Generation and Tuning , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[45]  Andy Nisbet,et al.  GAPS: Iterative Feedback Directed Parallelisation Using Genetic Algorithms , 2000 .

[46]  J. Ramanujam,et al.  Parameterized tiling revisited , 2010, CGO '10.

[47]  Chun Chen,et al.  Loop Transformation Recipes for Code Generation and Auto-Tuning , 2009, LCPC.

[48]  David A. Padua,et al.  A dynamically tuned sorting library , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[49]  David A. Padua,et al.  A Language for the Compact Representation of Multiple Program Versions , 2005, LCPC.

[50]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[51]  Ken Kennedy,et al.  Improving cache performance in dynamic applications through data and computation reorganization at run time , 1999, PLDI '99.

[52]  Frank Mueller,et al.  Auto-generation and auto-tuning of 3D stencil codes on GPU clusters , 2012, CGO '12.

[53]  D. Merrill,et al.  Policy-based tuning for performance portability and library co-optimization , 2012, 2012 Innovative Parallel Computing (InPar).

[54]  Allen D. Malony,et al.  The Tau Parallel Performance System , 2006, Int. J. High Perform. Comput. Appl..

[55]  P. Sadayappan,et al.  Annotation-based empirical performance tuning using Orio , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[56]  Rudolf Eigenmann,et al.  OpenMPC: Extended OpenMP Programming and Tuning for GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[57]  Lars Kotthoff,et al.  A Preliminary Evaluation of Machine Learning in Algorithm Selection for Search Problems , 2011, SOCS.

[58]  Paul D. Hovland,et al.  Generating Performance Bounds from Source Code , 2010, 2010 39th International Conference on Parallel Processing Workshops.

[59]  Yoav Shoham,et al.  A portfolio approach to algorithm select , 2003, IJCAI 2003.

[60]  Michael Voss,et al.  High-level adaptive program optimization with ADAPT , 2001, PPoPP '01.

[61]  Richard W. Vuduc,et al.  POET: Parameterized Optimizations for Empirical Tuning , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[62]  Babak Falsafi,et al.  Reference idempotency analysis: a framework for optimizing speculative execution , 2001, PPoPP '01.

[63]  Michail G. Lagoudakis,et al.  Algorithm Selection using Reinforcement Learning , 2000, ICML.

[64]  Joel H. Saltz,et al.  Programming Irregular Applications: Runtime Support, Compilation and Tools , 1997, Adv. Comput..

[65]  Yang Yang,et al.  Automatic Library Generation for BLAS3 on GPUs , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[66]  Keith D. Cooper,et al.  Optimizing for reduced code space using genetic algorithms , 1999, LCTES '99.

[67]  R. C. Whaley,et al.  Timing high performance kernels through empirical compilation , 2005, 2005 International Conference on Parallel Processing (ICPP'05).

[68]  Bart Selman,et al.  Algorithm portfolios , 2001, Artif. Intell..

[69]  Andrei Alexandrescu,et al.  Modern C++ design: generic programming and design patterns applied , 2001 .

[70]  Benoît Meister,et al.  A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction , 2010, GPGPU-3.

[71]  Samuel Williams,et al.  Optimization of geometric multigrid for emerging multi- and manycore processors , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[72]  Ivana Kruijff-Korbayová,et al.  A Portfolio Approach to Algorithm Selection , 2003, IJCAI.

[73]  Albert Cohen,et al.  Predictive modeling in a polyhedral optimization space , 2011, CGO 2011.

[74]  A. Nakano,et al.  Divide-and-conquer density functional theory on hierarchical real-space grids: Parallel implementation and applications , 2008 .

[75]  John R. Rice,et al.  The Algorithm Selection Problem , 1976, Adv. Comput..

[76]  Ken Kennedy,et al.  Model-guided empirical tuning of loop fusion , 2008, Int. J. High Perform. Syst. Archit..

[77]  William J. Dally,et al.  A tuning framework for software-managed memory hierarchies , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[78]  Larry Carter,et al.  Compile-time composition of run-time data and iteration reorderings , 2003, PLDI '03.

[79]  Boyana Norris,et al.  Autotuning Stencil-Based Computations on GPUs , 2012, 2012 IEEE International Conference on Cluster Computing.

[80]  Jack J. Dongarra,et al.  A comparison of search heuristics for empirical code optimization , 2008, 2008 IEEE International Conference on Cluster Computing.

[81]  Alan Edelman,et al.  PetaBricks: a language and compiler for algorithmic choice , 2009, PLDI '09.

[82]  Andrew S. Grimshaw,et al.  Scalable GPU graph traversal , 2012, PPoPP '12.

[83]  Katherine Yelick,et al.  OSKI: A library of automatically tuned sparse matrix kernels , 2005 .

[84]  James Demmel,et al.  Statistical Models for Empirical Search-Based Performance Tuning , 2004, Int. J. High Perform. Comput. Appl..

[85]  Nancy M. Amato,et al.  STAPL: standard template adaptive parallel library , 2010, SYSTOR '10.

[86]  Henri-Pierre Charles,et al.  OCEANS: Optimizing Compilers for Embedded Applications , 1998, European Conference on Parallel Processing.

[87]  Michael F. P. O'Boyle,et al.  Using machine learning to focus iterative optimization , 2006, International Symposium on Code Generation and Optimization (CGO'06).

[88]  Stephen F. Smith,et al.  Proceedings: The Fourth International Conference on Artificial Intelligence Planning Systems , 1998 .

[89]  Jacqueline Chame,et al.  A script-based autotuning compiler system to generate high-performance CUDA code , 2013, TACO.

[90]  Chun Chen,et al.  A scalable auto-tuning framework for compiler optimization , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.