SPAPT: Search Problems in Automatic Performance Tuning

Abstract Automatic performance tuning of computationally intensive kernels in scientific applications is a promising approach to achieving good performance on different machines while preserving the kernel implementation's readability and portability. A major bottleneck in automatic performance tuning is the computation time required to test a large number of possible code variants, which grows exponentially with the number of tuning parameters. Consequently, the design, development, and analysis of effective search techniques capable of quickly finding high-performing parameter configurations have gained significant attention in recent years. An important element needed for this research is a collection of test problems that allow performance engineering and mathematical optimization researchers to conduct rigorous algorithmic development and experimental studies. In this paper, we describe a set of extensible and portable search problems in automatic performance tuning (SPAPT) whose goal is to aid in the development and improvement of search strategies. SPAPT is a test suite that contains representative serial code implementations from a number of lower-level performance-tuning tasks in scientific applications. We present an illustrative experimental study on several problems from the test suite. We discuss important issues such as modeling, search space characteristics, and performance objectives.

[1]  Chun Chen,et al.  A scalable auto-tuning framework for compiler optimization , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[2]  I-Hsin Chung,et al.  A Case Study Using Automatic Performance Tuning for Large-Scale Scientific Programs , 2006, 2006 15th IEEE International Conference on High Performance Distributed Computing.

[3]  Jack J. Dongarra,et al.  A comparison of search heuristics for empirical code optimization , 2008, 2008 IEEE International Conference on Cluster Computing.

[4]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[5]  Samuel Williams,et al.  Implicit and explicit optimizations for stencil computations , 2006, MSPC '06.

[6]  Prasanna Balaprakash,et al.  An Experimental Study of Global and Local Search Algorithms in Empirical Performance Tuning , 2012, VECPAR.

[7]  Prasanna Balaprakash,et al.  Can search algorithms save large-scale automatic performance tuning? , 2011, ICCS.

[8]  Jack J. Dongarra,et al.  The LINPACK Benchmark: past, present and future , 2003, Concurr. Comput. Pract. Exp..

[9]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[10]  P. Sadayappan,et al.  Annotation-based empirical performance tuning using Orio , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[11]  Anoop Gupta,et al.  SPLASH: Stanford parallel applications for shared-memory , 1992, CARN.

[12]  William Gropp,et al.  Annotations for Productivity and Performance Portability , 2007 .

[13]  Kunle Olukotun,et al.  STAMP: Stanford Transactional Applications for Multi-Processing , 2008, 2008 IEEE International Symposium on Workload Characterization.

[14]  Samuel Williams,et al.  TORCH Computational Reference Kernels - A Testbed for Computer Science Research , 2010 .

[15]  Michael F. P. O'Boyle,et al.  Combined Selection of Tile Sizes and Unroll Factors Using Iterative Compilation , 2004, The Journal of Supercomputing.

[16]  Nicholas I. M. Gould,et al.  CUTEr and SifDec: A constrained and unconstrained testing environment, revisited , 2003, TOMS.

[17]  Jorge J. Moré,et al.  Testing Unconstrained Optimization Software , 1981, TOMS.

[18]  Ken Kennedy,et al.  Automatic tuning of whole applications using direct search and a performance-based transformation system , 2006, The Journal of Supercomputing.

[19]  Samuel Williams,et al.  An auto-tuning framework for parallel multicore stencil computations , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[20]  Ken Naono,et al.  Software Automatic Tuning : From Concepts to State-of-the-Art Results , 2010 .

[21]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[22]  Chun Chen,et al.  Autotuning and Specialization: Speeding up Matrix Multiply for Small Matrices with Compiler Technology , 2010, Software Automatic Tuning, From Concepts to State-of-the-Art Results.

[23]  Stefan M. Wild,et al.  Benchmarking Derivative-Free Optimization Algorithms , 2009, SIAM J. Optim..

[24]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[25]  Katherine Yelick,et al.  Accelerating Time-To-Solution for Computational Science and Engineering , 2009 .

[26]  Samuel Williams,et al.  Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors , 2007, SIAM Rev..

[27]  Albert Cohen,et al.  PrimeTile: A Parametric Multi-Level Tiler for Imperfect Loop Nests , 2009 .