论文信息 - Target-Specific Refinement of Multigrid Codes

Target-Specific Refinement of Multigrid Codes

This paper applies partial evaluation to stage a stencil code Domain-Specific Language (DSL) onto a functional and imperative programming language. Platform-specific primitives such as scheduling or vectorization, and algorithmic variants such as boundary handling are factored out into a library that make up the elements of that DSL. We show how partial evaluation can eliminate all overhead of this separation of concerns and creates code that resembles hand-crafted versions for a particular target platform. We evaluate our technique by implementing a DSL for the V-cycle multigrid iteration. Our approach generates code for AMD and NVIDIA GPUs (via SPIR and NVVM) as well as for CPUs using AVX/AVX2 alike from the same high-level DSL program. First results show that we achieve a speedup of up to 3x on the CPU by vectorizing multigrid components and a speedup of up to 2x on the GPU by merging the computation of multigrid components.

[1] Jürgen Teich,et al. Towards Domain-Specific Computing for Stencil Codes in HPC , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[2] Jan Vitek,et al. Terra: a multi-stage language for high-performance computing , 2013, PLDI.

[3] Andreas Dedner,et al. A generic grid interface for parallel and adaptive scientific computing. Part I: abstract framework , 2008, Computing.

[4] Yoshihiko Futamura,et al. Partial Evaluation of Computation Process--An Approach to a Compiler-Compiler , 1999, High. Order Symb. Comput..

[5] Andreas Dedner,et al. A generic grid interface for parallel and adaptive scientific computing. Part II: implementation and tests in DUNE , 2008, Computing.

[6] Eric Darve,et al. Liszt: A domain specific language for building portable mesh-based PDE solvers , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[7] 採編典藏組. Society for Industrial and Applied Mathematics(SIAM) , 2008 .

[8] Frédo Durand,et al. Decoupling algorithms from schedules for easy optimization of image processing pipelines , 2012, ACM Trans. Graph..

[9] Martin Odersky,et al. Lightweight modular staging: a pragmatic approach to runtime code generation and compiled DSLs , 2010, GPCE '10.

[10] Robert D. Falgout,et al. Scaling Hypre's Multigrid Solvers to 100, 000 Cores , 2011, High-Performance Scientific Computing.

[11] Thomas Johnsson,et al. Lambda Lifting: Treansforming Programs to Recursive Equations , 1985, FPCA.

[12] Samuel Williams,et al. An auto-tuning framework for parallel multicore stencil computations , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[13] Martin Odersky,et al. Spiral in scala: towards the systematic construction of generators for performance libraries , 2014, GPCE '13.

[14] Philipp Slusallek,et al. Code Refinement of Stencil Codes , 2014, Parallel Process. Lett..

[15] Harald Köstler,et al. Performance engineering to achieve real-time high dynamic range imaging , 2012, Journal of Real-Time Image Processing.

[16] Sebastian Hack,et al. Whole-function vectorization , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[17] William L. Briggs,et al. A multigrid tutorial, Second Edition , 2000 .

[18] Helmar Burkhart,et al. PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[19] Bradley C. Kuszmaul,et al. The pochoir stencil compiler , 2011, SPAA '11.