A Modular Approach to Performance, Portability and Productivity for 3D Wave Models

The HPC hardware landscape is growing increasingly complex in order to meet demands in scientific computing for greater performance. In recent years there has been an explosion of parallel devices coming on to the scene: GPUs, Xeon Phis and FPGAs to name but a few examples. As of writing, even the current leading supercomputer, Sunway TianhuLight, uses its own bespoke on-chip accelerators[12]. Available programming models, however, lag behind and are not currently able to provide the necessary tools for running scientific codes across platforms in ways that are performant, portable and productive. This environment creates a plethora of challenges for computational scientists of which we focus on two: first the need for a high level of productivity for codes that still get good performance and second consistently getting good performance across platforms the “performance portability” problem. Existing solutions tend to be either not productive but provide good performance or focus on high-level abstractions requiring heuristics to get good performance (often which are tied to particular platforms). While some current approaches raise the productivity level, they are often trying to solve the same problems over and over or trying to solve too many issues for a niche domain. In addition, many of these approaches have only been tested on simplistic benchmarks, which can lose critical functionality of real-world simulation codes. We instead propose a modular approach using existing frameworks to target these issues separately: a high-level DSL to target the productivity problem compiling into an IR language which addresses the performance portability problem. Our previous research has shown that the development of more productive and performance portable codes for room acoustics simulations is possible. Preliminary results using the intermediary parallel language lift[16] confirm that this framework is capable of handling complex stencils. Further developing lift and targeting existing stencil-focused DSLs will create a simple, modularized approach which harnesses and expands existing functionality instead of trying to reinvent the wheel. This modular approach can then be used as an example to extend to other physical simulations using similar algorithms. WOLFHPC, November 2017, Denver, USA 2017. ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00 https://doi.org/10.1145/nnnnnnn.nnnnnnn

[1]  D. Botteldooren Finite‐difference time‐domain simulation of low‐frequency room acoustic problems , 1995 .

[2]  Daniel Sunderland,et al.  Kokkos: Enabling manycore performance portability through polymorphic memory access patterns , 2014, J. Parallel Distributed Comput..

[3]  Sergei Gorlatch,et al.  High-Level Programming of Stencil Computations on Multi-GPU Systems Using the SkelCL Library , 2014, Parallel Process. Lett..

[4]  Pradeep Dubey,et al.  3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[5]  Craig J. Webb Parallel computation techniques for virtual acoustics and physical modelling synthesis , 2014 .

[6]  Sam Lindley,et al.  Generating performance portable code using rewrite rules: from high-level functional expressions to high-performance OpenCL code , 2015, ICFP.

[7]  Jürgen Teich,et al.  ExaStencils: Advanced Stencil-Code Engineering , 2014, Euro-Par Workshops.

[8]  Frédo Durand,et al.  Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.

[9]  Michel Steuwer,et al.  LIFT: A functional data-parallel IR for high-performance GPU code generation , 2017, 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[10]  Stefan Bilbao,et al.  Performance portability for room acoustics simulations , 2017 .

[11]  Kunle Olukotun,et al.  Delite , 2014, ACM Trans. Embed. Comput. Syst..

[12]  Paul Graham,et al.  Large Scale Physical Modeling Sound Synthesis , 2013 .

[13]  Antonios Giannopoulos,et al.  Modelling ground penetrating radar by GprMax , 2005 .

[14]  Alan Edelman,et al.  PetaBricks: a language and compiler for algorithmic choice , 2009, PLDI '09.

[15]  Luís Fabrício Wanderley Góes,et al.  PSkel: A stencil programming framework for CPU‐GPU systems , 2015, Concurr. Comput. Pract. Exp..

[16]  Murray Cole,et al.  Algorithmic Skeletons: Structured Management of Parallel Computation , 1989 .

[17]  Christoph W. Kessler,et al.  SkePU: a multi-backend skeleton programming library for multi-GPU systems , 2010, HLPP '10.

[18]  Bradley C. Kuszmaul,et al.  The pochoir stencil compiler , 2011, SPAA '11.

[19]  Kevin Stratford,et al.  targetDP: an Abstraction of Lattice Based Parallelism with Portable Performance , 2014, 2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS).

[20]  Ronan Keryell,et al.  Khronos SYCL for OpenCL: a tutorial , 2015, IWOCL.