PATUS: A Code Generation and Auto-Tuning Framework For Parallel Stencil Computations

PATUS is a code generation and auto-tuning framework for stencil computations targeted at modern multiand many-core processors, such as multicore CPUs and graphics processing units. Its ultimate goals are to provide a means towards productivity and performance on current and future multiand many-core platforms. The framework generates the code for a compute kernel from a specification of the stencil operation and a Strategy: a description of the parallelization and optimization methods to be applied. We leverage the auto-tuning methodology to find the optimal hardware architecture-specific and Strategy-specific parameter configuration.

[1]  Chun Chen,et al.  Loop Transformation Recipes for Code Generation and Auto-Tuning , 2009, LCPC.

[2]  Samuel Williams,et al.  An auto-tuning framework for parallel multicore stencil computations , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[3]  Rudolf Eigenmann,et al.  PEAK—a fast and effective performance tuning system via compiler optimization orchestration , 2008, TOPL.

[4]  Chau-Wen Tseng,et al.  Tiling Optimizations for 3D Scientific Computations , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[5]  Bradley C. Kuszmaul,et al.  The pochoir stencil compiler , 2011, SPAA '11.

[6]  Zhiyuan Li,et al.  Automatic tiling of iterative stencil loops , 2004, TOPL.

[7]  Sanjay V. Rajopadhye,et al.  Parameterized tiled loops for free , 2007, PLDI '07.

[8]  Dhabaleswar K. Panda,et al.  Scalable Earthquake Simulation on Petascale Supercomputers , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[9]  Kevin Skadron,et al.  A Performance Study for Iterative Stencil Loops on GPUs with Ghost Zone Optimizations , 2011, International Journal of Parallel Programming.

[10]  Zhiyuan Li,et al.  New tiling techniques to improve cache temporal locality , 1999, PLDI '99.

[11]  Uday Bondhugula,et al.  A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.

[12]  Rudolf Eigenmann,et al.  Cetus: A Source-to-Source Compiler Infrastructure for Multicores , 2009, Computer.

[13]  Helmar Burkhart,et al.  PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[14]  Peter Messmer,et al.  Parallel data-locality aware stencil computations on modern micro-architectures , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[15]  Robert Strzodka,et al.  Time skewing made simple , 2011, PPoPP '11.

[16]  Gerhard Wellein,et al.  Efficient Temporal Blocking for Stencil Computations by Multicore-Aware Wavefront Parallelization , 2009, 2009 33rd Annual IEEE International Computer Software and Applications Conference.

[17]  Volker Strumpen,et al.  Cache oblivious stencil computations , 2005, ICS '05.

[18]  Scott B. Baden,et al.  Mint: realizing CUDA performance in 3D stencil methods with annotated C , 2011, ICS '11.

[19]  Zhiyuan Li,et al.  A Compiler Framework for Tiling Imperfectly-Nested Loops , 1999, LCPC.

[20]  Samuel Williams,et al.  Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors , 2007, SIAM Rev..