Auto-tuning full applications: A case study

In this paper, we take a concrete step towards materializing our long-term goal of providing a fully automatic end-to-end tuning infrastructure for arbitrary program components and full applications. We describe a general-purpose offline auto-tuning framework and apply it to an application benchmark, SMG2000, a semi-coarsening multigrid on structured grids. We show that the proposed system first extracts computationally intensive loop nests into separate executable functions, a code transformation called outlining. The outlined loop nests are then tuned by the framework and subsequently integrated back into the application. Each loop nest is optimized through a series of composable code transformations, with the transformations parameterized by unbound optimization parameters that are bound during the tuning process. The values for these parameters are selected using a search-based auto-tuner, which performs a parallel heuristic search for the best-performing optimized variants of the outlined loop nests. We show that our system pinpoints a code variant that performs 2.37 times faster than the original loop nest. When the full application is run using the code variant found by the system, the application’s performance improves by 27%.

[1]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[2]  David A. Padua,et al.  SPL: a language and compiler for DSP algorithms , 2001, PLDI '01.

[3]  David Parello,et al.  Semi-Automatic Composition of Loop Transformations for Deep Parallelism and Memory Hierarchies , 2006, International Journal of Parallel Programming.

[4]  Chun Chen,et al.  A Programming Language Interface to Describe Transformations and Code Generation , 2010, LCPC.

[5]  I-Hsin Chung,et al.  Active Harmony: Towards Automated Performance Tuning , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[6]  Allen D. Malony,et al.  The Tau Parallel Performance System , 2006, Int. J. High Perform. Comput. Appl..

[7]  Chun Chen,et al.  Loop Transformation Recipes for Code Generation and Auto-Tuning , 2009, LCPC.

[8]  P. Sadayappan,et al.  Annotation-based empirical performance tuning using Orio , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[9]  Bronis R. de Supinski,et al.  A ROSE-Based OpenMP 3.0 Research Compiler Supporting Multiple Runtime Libraries , 2010, IWOMP.

[10]  Daniel J. Quinlan,et al.  Semantic-Aware Automatic Parallelization of Modern Applications Using High-Level Abstractions , 2010, International Journal of Parallel Programming.

[11]  Michael F. P. O'Boyle,et al.  Combined Selection of Tile Sizes and Unroll Factors Using Iterative Compilation , 2000, Proceedings 2000 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00622).

[12]  Vahid Tabatabaee,et al.  Parallel Parameter Tuning for Applications with Performance Variability , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[13]  Chuan Lu,et al.  Simulating subsurface flow and transport on ultrascale computers using PFLOTRAN , 2007 .

[14]  Chun Chen,et al.  Model-guided empirical optimization for memory hierarchy , 2007 .

[15]  José Nelson Amaral,et al.  Ablego: a function outlining and partial inlining framework , 2007, Softw. Pract. Exp..

[16]  Katherine Yelick,et al.  Performance Engineering: Understanding and Improving thePerformance of Large-Scale Codes , 2007 .

[17]  Ken Kennedy,et al.  Automatic tuning of whole applications using direct search and a performance-based transformation system , 2006, The Journal of Supercomputing.

[18]  I-Hsin Chung,et al.  A Case Study Using Automatic Performance Tuning for Large-Scale Scientific Programs , 2006, 2006 15th IEEE International Conference on High Performance Distributed Computing.

[19]  Albert Cohen,et al.  Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time , 2007, International Symposium on Code Generation and Optimization (CGO'07).

[20]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[21]  Nathan R. Tallent,et al.  HPCTOOLKIT: tools for performance analysis of optimized parallel programs , 2010, Concurr. Comput. Pract. Exp..

[22]  Chun Chen,et al.  A scalable auto-tuning framework for compiler optimization , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[23]  José Nelson Amaral,et al.  Ablego : a function outlining and partial inlining framework: Research Articles , 2007 .

[24]  Albert Cohen,et al.  Iterative optimization in the polyhedral model: part ii, multidimensional time , 2008, PLDI '08.

[25]  Samuel Williams,et al.  Optimizing and tuning the fast multipole method for state-of-the-art multicore architectures , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[26]  Richard W. Vuduc,et al.  POET: Parameterized Optimizations for Empirical Tuning , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[27]  Katherine Yelick,et al.  OSKI: A library of automatically tuned sparse matrix kernels , 2005 .

[28]  Arun Lakhotia,et al.  Restructuring programs by tucking statements into functions , 1998, Inf. Softw. Technol..

[29]  Robert D. Falgout,et al.  Semicoarsening Multigrid on Distributed Memory Machines , 1999, SIAM J. Sci. Comput..

[30]  Yoon-Ju Lee,et al.  A Code Isolator: Isolating Code Fragments from Large Programs , 2004, LCPC.

[31]  Richard W. Vuduc,et al.  Effective Source-to-Source Outlining to Support Whole Program Empirical Optimization , 2009, LCPC.