Automating Compiler-Directed Autotuning for Phased Performance Behavior

We describe an integration of the CHiLL compiler with OpenTuner to reduce the programmer burden in using autotuning. We use as a case study optimizing the smooth operator and its associated stencil computations in the context of Geometric Multigrid (GMG), a hierarchical linear solver that operates in multiple grid resolutions (levels). Smooth is the most performance-critical operation that runs multiple times at each grid level and effectively performs a relaxation of the approximated solution at a given grid resolution. This computation poses a particular challenge for autotuning, as the desired optimization strategy varies at different grid resolutions within the same application execution. Even though the compiler provides a number of standard and domain-specific optimizations for stencil computations, it is challenging for a programmer to decide which optimizations to perform and implement all the steps of the autotuning search. In this paper, we make the following contributions to simplify this process and make it possible to configure the application for its different phases: (1) we provide an interface (called a superscript) to concisely describe a search space and automatically generate CHiLL transformation recipes; and, (2) we use OpenTuner tailored to CHiLL transformation recipes to employ sophisticated heuristic algorithms that manage the computational complexity of search. We demonstrate performance that far exceeds that of fixed optimization strategies, while only sampling a tiny subset of the autotuning search space.

[1]  Uday Bondhugula,et al.  Effective automatic parallelization of stencil computations , 2007, PLDI '07.

[2]  David A. Padua,et al.  SPL: a language and compiler for DSP algorithms , 2001, PLDI '01.

[3]  Samuel Williams,et al.  Optimization of geometric multigrid for emerging multi- and manycore processors , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[4]  A Thesis,et al.  Tiling Stencil Computations to Maximize Parallelism , 2013 .

[5]  Michael Wolfe,et al.  Loops skewing: The wavefront method revisited , 1986, International Journal of Parallel Programming.

[6]  James Demmel,et al.  Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.

[7]  R. C. Whaley,et al.  ATLAS (Automatically Tuned Linear Algebra Software) , 2011, Encyclopedia of Parallel Computing.

[8]  Shoaib Kamil,et al.  OpenTuner: An extensible framework for program autotuning , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[9]  Alan Edelman,et al.  Autotuning multigrid with PetaBricks , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[10]  Prasanna Balaprakash,et al.  Generating Efficient Tensor Contractions for GPUs , 2015, 2015 44th International Conference on Parallel Processing.

[11]  J. Ramanujam,et al.  A framework for enhancing data reuse via associative reordering , 2014, PLDI.

[12]  Matteo Frigo A Fast Fourier Transform Compiler , 1999, PLDI.

[13]  Shoaib Ashraf Kamil,et al.  Productive High Performance Parallel Programming with Auto-tuned Domain-Specific Embedded Languages , 2012 .

[14]  P. Sadayappan,et al.  Annotation-based empirical performance tuning using Orio , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[15]  Hongbo Rong,et al.  Automating Wavefront Parallelization for Sparse Matrix Computations , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[16]  Chun Chen,et al.  A scalable auto-tuning framework for compiler optimization , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[17]  Mary W. Hall,et al.  Non-affine Extensions to Polyhedral Code Generation , 2014, CGO '14.

[18]  Protonu Basu,et al.  Compiler Optimizations and Attuning for Stencils and Geometric Multigrid , 2016 .

[19]  Bradley C. Kuszmaul,et al.  The pochoir stencil compiler , 2011, SPAA '11.

[20]  Mary W. Hall,et al.  Loop and data transformations for sparse matrix code , 2015, PLDI.

[21]  I-Hsin Chung,et al.  Active Harmony: Towards Automated Performance Tuning , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[22]  Samuel Williams,et al.  Compiler generation and autotuning of communication-avoiding operators for geometric multigrid , 2013, 20th Annual International Conference on High Performance Computing.

[23]  Chun Chen,et al.  Loop Transformation Recipes for Code Generation and Auto-Tuning , 2009, LCPC.

[24]  Samuel Williams,et al.  Compiler-Directed Transformation for Higher-Order Stencils , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[25]  Mary W. Hall,et al.  CHiLL : A Framework for Composing High-Level Loop Transformations , 2007 .

[26]  Helmar Burkhart,et al.  PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[27]  Katherine Yelick,et al.  OSKI: A library of automatically tuned sparse matrix kernels , 2005 .