Exploiting Performance Portability in Search Algorithms for Autotuning

Autotuning seeks the best configuration of an application by orchestrating hardware and software knobs that affect performance on a given machine. Autotuners adopt various search techniques to efficiently find the best configuration, but they often ignore lessons learned on one machine when tuning for another machine. We demonstrate that a surrogate model built from performance results on one machine can speedup the autotuning search by 1.6X to 130X on a variety of modern architectures.

[1]  Stephen J. Wright,et al.  Warm-Start Strategies in Interior-Point Methods for Linear Programming , 2002, SIAM J. Optim..

[2]  Shoaib Kamil,et al.  OpenTuner: An extensible framework for program autotuning , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[3]  Michael F. P. O'Boyle,et al.  A large-scale cross-architecture evaluation of thread-coarsening , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[4]  David A. Padua,et al.  A Language for the Compact Representation of Multiple Program Versions , 2005, LCPC.

[5]  Graham R. Nudd,et al.  Pace—A Toolset for the Performance Prediction of Parallel and Distributed Systems , 2000, Int. J. High Perform. Comput. Appl..

[6]  John Cavazos,et al.  Intelligent compilers , 2008, 2008 IEEE International Conference on Cluster Computing.

[7]  Michael Garland,et al.  Nitro: A Framework for Adaptive Code Variant Tuning , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[8]  Prasanna Balaprakash,et al.  SPAPT: Search Problems in Automatic Performance Tuning , 2012, ICCS.

[9]  Prasanna Balaprakash,et al.  Machine-Learning-Based Load Balancing for Community Ice Code Component in CESM , 2014, VECPAR.

[10]  Sameer Kulkarni,et al.  An evaluation of different modeling techniques for iterative compilation , 2011, 2011 Proceedings of the 14th International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES).

[11]  Jack J. Dongarra,et al.  A comparison of search heuristics for empirical code optimization , 2008, 2008 IEEE International Conference on Cluster Computing.

[12]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[13]  Chun Chen,et al.  Model-guided empirical optimization for memory hierarchy , 2007 .

[14]  Prasanna Balaprakash,et al.  An Experimental Study of Global and Local Search Algorithms in Empirical Performance Tuning , 2012, VECPAR.

[15]  Keith D. Cooper,et al.  ACME: adaptive compilation made efficient , 2005, LCTES '05.

[16]  Michael F. P. O'Boyle,et al.  MILEPOST GCC: machine learning based research compiler , 2008 .

[17]  Prasanna Balaprakash,et al.  Generating Efficient Tensor Contractions for GPUs , 2015, 2015 44th International Conference on Parallel Processing.

[18]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[19]  Venkatram Vishwanath,et al.  GROPHECY: GPU performance projection from CPU code skeletons , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[20]  P. Sadayappan,et al.  Annotation-based empirical performance tuning using Orio , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[21]  Prasanna Balaprakash,et al.  Active-learning-based surrogate models for empirical performance tuning , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).

[22]  David A. Padua,et al.  Compile-Time Based Performance Prediction , 1999, LCPC.

[23]  William J. Dally,et al.  A tuning framework for software-managed memory hierarchies , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[24]  Grigori Fursin,et al.  Probabilistic source-level optimisation of embedded programs , 2005, LCTES '05.

[25]  Steffen Becker,et al.  Model-Based performance prediction with the palladio component model , 2007, WOSP '07.

[26]  D. Merrill,et al.  Policy-based tuning for performance portability and library co-optimization , 2012, 2012 Innovative Parallel Computing (InPar).

[27]  Samuel Williams,et al.  Performance Tuning of Scientific Applications , 2010 .

[28]  Michael F. P. O'Boyle,et al.  Combined Selection of Tile Sizes and Unroll Factors Using Iterative Compilation , 2004, The Journal of Supercomputing.

[29]  Venkatram Vishwanath,et al.  SKOPE: a framework for modeling and exploring workload behavior , 2014, Conf. Computing Frontiers.

[30]  Chun Chen,et al.  A scalable auto-tuning framework for compiler optimization , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[31]  Mark Stephenson,et al.  Predicting unroll factors using supervised classification , 2005, International Symposium on Code Generation and Optimization.

[32]  Reuven Y. Rubinstein,et al.  Simulation and the Monte Carlo method , 1981, Wiley series in probability and mathematical statistics.

[33]  Richard W. Vuduc,et al.  POET: Parameterized Optimizations for Empirical Tuning , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.