A Domain-Specific Language and Compiler for Stencil Computations on Short-Vector SIMD and GPU Architectures

Stencil computations are an integral part of applications in a number of scientific computing domains, such as image processing and partial differential equations. We describe a domain-specific language for regular stencil computations, that allows specification of the computations in a concise manner. We describe a multi-target compiler for this DSL, that generates optimized code for multi-core processors with short-vector SIMD engines, as well as GPUs. The hardware differences between these two types of architecture prompt different optimization strategies for the compiler. A data layout transformation along with split tiling is used for multi-core CPUs, while overlapped tiling is used for GPUs. We evaluate our domain-specific compiler for a number of benchmarks on CPU and GPU platforms.

[1]  Kevin Skadron,et al.  A Performance Study for Iterative Stencil Loops on GPUs with Ghost Zone Optimizations , 2011, International Journal of Parallel Programming.

[2]  Jason Cong,et al.  Accelerating Fluid Registration Algorithm on Multi-FPGA Platforms , 2011, 2011 21st International Conference on Field Programmable Logic and Applications.

[3]  Lei Huang,et al.  PADS: A Pattern-Driven Stencil Compiler-Based Tool for Reuse of Optimizations on GPGPUs , 2011, 2011 IEEE 17th International Conference on Parallel and Distributed Systems.

[4]  Franz Franchetti,et al.  Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures , 2011, CC.

[5]  Ravindra K. Ahuja,et al.  Network Flows: Theory, Algorithms, and Applications , 1993 .

[6]  Helmar Burkhart,et al.  PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[7]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[8]  P. Sadayappan,et al.  High-performance code generation for stencil computations on GPU architectures , 2012, ICS '12.

[9]  A. Taflove The Finite-Difference Time-Domain Method , 1995 .

[10]  Jason Cong,et al.  Lithographic aerial image simulation with FPGA-based hardwareacceleration , 2008, FPGA '08.

[11]  G. Smith,et al.  Numerical Solution of Partial Differential Equations: Finite Difference Methods , 1978 .

[12]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[13]  Gerhard Wellein,et al.  Efficient multicore-aware parallelization strategies for iterative stencil computations , 2010, J. Comput. Sci..

[14]  Albert Cohen,et al.  Split tiling for GPUs: automatic parallelization using trapezoidal tiles , 2013, GPGPU@ASPLOS.

[15]  Helmar Burkhart,et al.  PATUS: A Code Generation and Auto-Tuning Framework For Parallel Stencil Computations , 2011 .

[16]  G. Dantzig,et al.  FINDING A CYCLE IN A GRAPH WITH MINIMUM COST TO TIME RATIO WITH APPLICATION TO A SHIP ROUTING PROBLEM , 1966 .

[17]  Uday Bondhugula,et al.  Effective automatic parallelization of stencil computations , 2007, PLDI '07.

[18]  W. Marsden I and J , 2012 .

[19]  Ronald L. Rivest,et al.  Introduction to Algorithms, Second Edition , 2001 .

[20]  Hans-Peter Seidel,et al.  Cache oblivious parallelograms in iterative stencil computations , 2010, ICS '10.

[21]  Pradeep Dubey,et al.  3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[22]  Paulius Micikevicius,et al.  3D finite difference computation on GPUs using CUDA , 2009, GPGPU-2.

[23]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[24]  Uday Bondhugula,et al.  Tiling stencil computations to maximize parallelism , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[25]  Richard Veras,et al.  A stencil compiler for short-vector SIMD architectures , 2013, ICS '13.

[26]  Bradley C. Kuszmaul,et al.  The pochoir stencil compiler , 2011, SPAA '11.

[27]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[28]  John B. Shoven,et al.  I , Edinburgh Medical and Surgical Journal.