MODESTO: Data-centric Analytic Optimization of Complex Stencil Programs on Heterogeneous Architectures

Code transformations, such as loop tiling and loop fusion, are of key importance for the efficient implementation of stencil computations. However, their direct application to a large code base is costly and severely impacts program maintainability. While recently introduced domain-specific languages facilitate the application of such transformations, they typically still require manual tuning or auto-tuning techniques to select the transformations that yield optimal performance. In this paper, we introduce MODESTO, a model-driven stencil optimization framework, that for a stencil program suggests program transformations optimized for a given target architecture. Initially, we review and categorize data locality transformations for stencil programs and introduce a stencil algebra that allows the expression and enumeration of different stencil program implementation variants. Combining this algebra with a compile-time performance model, we show how to automatically tune stencil programs. We use our framework to model the STELLA library and optimize kernels used by the COSMO atmospheric model on multi-core and hybrid CPU-GPU architectures. Compared to naive and expert-tuned variants, the automatically tuned kernels attain a 2.0-3.1x and a 1.0-1.8x speedup respectively.

[1]  P. Sadayappan,et al.  High-performance code generation for stencil computations on GPU architectures , 2012, ICS '12.

[2]  Allen Taflove,et al.  Review of the formulation and applications of the finite-difference time-domain method for numerical modeling of electromagnetic wave interactions with arbitrary structures , 1988 .

[3]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[4]  Tobias Gysi,et al.  Towards a performance portable, architecture agnostic implementation strategy for weather and climate models , 2014, Supercomput. Front. Innov..

[5]  Catherine Mills Olschanowsky,et al.  A Study on Balancing Parallelism, Data Locality, and Recomputation in Existing PDE Solvers , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[6]  Frédo Durand,et al.  Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.

[7]  Vincent Loechner,et al.  Counting Integer Points in Parametric Polytopes Using Barvinok's Rational Functions , 2007, Algorithmica.

[8]  Frank Mueller,et al.  Autogeneration and Autotuning of 3D Stencil Codes on Homogeneous and Heterogeneous GPU Clusters , 2013, IEEE Transactions on Parallel and Distributed Systems.

[9]  Vivek Sarkar,et al.  Analytical Bounds for Optimal Tile Size Selection , 2012, CC.

[10]  G. McMechan MIGRATION BY EXTRAPOLATION OF TIME‐DEPENDENT BOUNDARY VALUES* , 1983 .

[11]  Helmar Burkhart,et al.  PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[12]  Xing Zhou,et al.  Hierarchical overlapped tiling , 2012, CGO '12.

[13]  Uday Bondhugula,et al.  PolyMage: Automatic Optimization for Image Processing Pipelines , 2015, ASPLOS.

[14]  Samuel Williams,et al.  Compiler generation and autotuning of communication-avoiding operators for geometric multigrid , 2013, 20th Annual International Conference on High Performance Computing.

[15]  Sven Verdoolaege,et al.  isl: An Integer Set Library for the Polyhedral Model , 2010, ICMS.

[16]  Volker Strumpen,et al.  The memory behavior of cache oblivious stencil computations , 2007, The Journal of Supercomputing.

[17]  Bradley C. Kuszmaul,et al.  The pochoir stencil compiler , 2011, SPAA '11.

[18]  Mohamed Wahib,et al.  Scalable Kernel Fusion for Memory-Bound GPU Applications , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[19]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[20]  Sanjay V. Rajopadhye,et al.  Positivity, posynomials and tile size selection , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.