Target-Specific Refinement of Multigrid Codes

This paper applies partial evaluation to stage a stencil code Domain-Specific Language (DSL) onto a functional and imperative programming language. Platform-specific primitives such as scheduling or vectorization, and algorithmic variants such as boundary handling are factored out into a library that make up the elements of that DSL. We show how partial evaluation can eliminate all overhead of this separation of concerns and creates code that resembles hand-crafted versions for a particular target platform. We evaluate our technique by implementing a DSL for the V-cycle multigrid iteration. Our approach generates code for AMD and NVIDIA GPUs (via SPIR and NVVM) as well as for CPUs using AVX/AVX2 alike from the same high-level DSL program. First results show that we achieve a speedup of up to 3x on the CPU by vectorizing multigrid components and a speedup of up to 2x on the GPU by merging the computation of multigrid components.

[1]  Jürgen Teich,et al.  Towards Domain-Specific Computing for Stencil Codes in HPC , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[2]  Jan Vitek,et al.  Terra: a multi-stage language for high-performance computing , 2013, PLDI.

[3]  Andreas Dedner,et al.  A generic grid interface for parallel and adaptive scientific computing. Part I: abstract framework , 2008, Computing.

[4]  Yoshihiko Futamura,et al.  Partial Evaluation of Computation Process--An Approach to a Compiler-Compiler , 1999, High. Order Symb. Comput..

[5]  Andreas Dedner,et al.  A generic grid interface for parallel and adaptive scientific computing. Part II: implementation and tests in DUNE , 2008, Computing.

[6]  Eric Darve,et al.  Liszt: A domain specific language for building portable mesh-based PDE solvers , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[7]  採編典藏組 Society for Industrial and Applied Mathematics(SIAM) , 2008 .

[8]  Frédo Durand,et al.  Decoupling algorithms from schedules for easy optimization of image processing pipelines , 2012, ACM Trans. Graph..

[9]  Martin Odersky,et al.  Lightweight modular staging: a pragmatic approach to runtime code generation and compiled DSLs , 2010, GPCE '10.

[10]  Robert D. Falgout,et al.  Scaling Hypre's Multigrid Solvers to 100, 000 Cores , 2011, High-Performance Scientific Computing.

[11]  Thomas Johnsson,et al.  Lambda Lifting: Treansforming Programs to Recursive Equations , 1985, FPCA.

[12]  Samuel Williams,et al.  An auto-tuning framework for parallel multicore stencil computations , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[13]  Martin Odersky,et al.  Spiral in scala: towards the systematic construction of generators for performance libraries , 2014, GPCE '13.

[14]  Philipp Slusallek,et al.  Code Refinement of Stencil Codes , 2014, Parallel Process. Lett..

[15]  Harald Köstler,et al.  Performance engineering to achieve real-time high dynamic range imaging , 2012, Journal of Real-Time Image Processing.

[16]  Sebastian Hack,et al.  Whole-function vectorization , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[17]  William L. Briggs,et al.  A multigrid tutorial, Second Edition , 2000 .

[18]  Helmar Burkhart,et al.  PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[19]  Bradley C. Kuszmaul,et al.  The pochoir stencil compiler , 2011, SPAA '11.