论文信息 - Exploiting Performance Portability in Search Algorithms for Autotuning

Exploiting Performance Portability in Search Algorithms for Autotuning

Autotuning seeks the best configuration of an application by orchestrating hardware and software knobs that affect performance on a given machine. Autotuners adopt various search techniques to efficiently find the best configuration, but they often ignore lessons learned on one machine when tuning for another machine. We demonstrate that a surrogate model built from performance results on one machine can speedup the autotuning search by 1.6X to 130X on a variety of modern architectures.

Prasanna Balaprakash | Amit Roy | Paul D. Hovland | Stefan M. Wild

[1] Stephen J. Wright,et al. Warm-Start Strategies in Interior-Point Methods for Linear Programming , 2002, SIAM J. Optim..

[2] Shoaib Kamil,et al. OpenTuner: An extensible framework for program autotuning , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[3] Michael F. P. O'Boyle,et al. A large-scale cross-architecture evaluation of thread-coarsening , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[4] David A. Padua,et al. A Language for the Compact Representation of Multiple Program Versions , 2005, LCPC.

[5] Graham R. Nudd,et al. Pace—A Toolset for the Performance Prediction of Parallel and Distributed Systems , 2000, Int. J. High Perform. Comput. Appl..

[6] John Cavazos,et al. Intelligent compilers , 2008, 2008 IEEE International Conference on Cluster Computing.

[7] Michael Garland,et al. Nitro: A Framework for Adaptive Code Variant Tuning , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[8] Prasanna Balaprakash,et al. SPAPT: Search Problems in Automatic Performance Tuning , 2012, ICCS.

[9] Prasanna Balaprakash,et al. Machine-Learning-Based Load Balancing for Community Ice Code Component in CESM , 2014, VECPAR.

[10] Sameer Kulkarni,et al. An evaluation of different modeling techniques for iterative compilation , 2011, 2011 Proceedings of the 14th International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES).

[11] Jack J. Dongarra,et al. A comparison of search heuristics for empirical code optimization , 2008, 2008 IEEE International Conference on Cluster Computing.

[12] Samuel Williams,et al. Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[13] Chun Chen,et al. Model-guided empirical optimization for memory hierarchy , 2007 .

[14] Prasanna Balaprakash,et al. An Experimental Study of Global and Local Search Algorithms in Empirical Performance Tuning , 2012, VECPAR.

[15] Keith D. Cooper,et al. ACME: adaptive compilation made efficient , 2005, LCTES '05.

[16] Michael F. P. O'Boyle,et al. MILEPOST GCC: machine learning based research compiler , 2008 .

[17] Prasanna Balaprakash,et al. Generating Efficient Tensor Contractions for GPUs , 2015, 2015 44th International Conference on Parallel Processing.

[18] Leo Breiman,et al. Random Forests , 2001, Machine Learning.

[19] Venkatram Vishwanath,et al. GROPHECY: GPU performance projection from CPU code skeletons , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[20] P. Sadayappan,et al. Annotation-based empirical performance tuning using Orio , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[21] Prasanna Balaprakash,et al. Active-learning-based surrogate models for empirical performance tuning , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).

[22] David A. Padua,et al. Compile-Time Based Performance Prediction , 1999, LCPC.

[23] William J. Dally,et al. A tuning framework for software-managed memory hierarchies , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[24] Grigori Fursin,et al. Probabilistic source-level optimisation of embedded programs , 2005, LCTES '05.

[25] Steffen Becker,et al. Model-Based performance prediction with the palladio component model , 2007, WOSP '07.

[26] D. Merrill,et al. Policy-based tuning for performance portability and library co-optimization , 2012, 2012 Innovative Parallel Computing (InPar).

[27] Samuel Williams,et al. Performance Tuning of Scientific Applications , 2010 .

[28] Michael F. P. O'Boyle,et al. Combined Selection of Tile Sizes and Unroll Factors Using Iterative Compilation , 2004, The Journal of Supercomputing.

[29] Venkatram Vishwanath,et al. SKOPE: a framework for modeling and exploring workload behavior , 2014, Conf. Computing Frontiers.

[30] Chun Chen,et al. A scalable auto-tuning framework for compiler optimization , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[31] Mark Stephenson,et al. Predicting unroll factors using supervised classification , 2005, International Symposium on Code Generation and Optimization.

[32] Reuven Y. Rubinstein,et al. Simulation and the Monte Carlo method , 1981, Wiley series in probability and mathematical statistics.

[33] Richard W. Vuduc,et al. POET: Parameterized Optimizations for Empirical Tuning , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.