A Multi-level Optimization Strategy to Improve the Performance of Stencil Computation

Abstract Stencil computation represents an important numerical kernel in scientific computing. Leveraging multi-core or many-core parallelism to optimize such operations represents a major challenge due to both the bandwidth demand and the low arithmetic intensity. The situation is worsened by the complexity of current architectures and the potential impact of various mechanisms (cache memory, vectorization, compilation). In this paper, we describe a multi-level optimization strategy that combines manual vectorization, space tiling and stencil composition. A major effort of this study is to compare our results with the Pochoir framework. We evaluate our methodology with a set of three different compilers (Intel, Clang and GCC) on two recent generations of Intel multi-core platforms. Our results show a good match with the theoretical performance models (i.e. roofline models). We also outperform Pochoir performance by a factor of x2.5 in the best case.

[1]  Helmar Burkhart,et al.  Automatic code generation and tuning for stencil kernels on modern shared memory architectures , 2011, Computer Science - Research and Development.

[2]  Uday Bondhugula,et al.  Compiler-assisted dynamic scheduling for effective parallelization of loop nests on multicore processors , 2009, PPoPP '09.

[3]  Philippe Olivier Alexandre Navaux,et al.  Seismic wave propagation simulations on low-power and performance-centric manycores , 2016, Parallel Comput..

[4]  Chau-Wen Tseng,et al.  Tiling Optimizations for 3D Scientific Computations , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[5]  Samuel Williams,et al.  Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors , 2007, SIAM Rev..

[6]  Dietmar Fey,et al.  High Performance Stencil Code Algorithms for GPGPUs , 2011, ICCS.

[7]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[8]  Weiqiang Wang,et al.  A Multilevel Parallelization Framework for High-Order Stencil Computations , 2009, Euro-Par.

[9]  Werner Augustin,et al.  Optimized Stencil Computation Using In-Place Calculation on Modern Multicore Systems , 2009, Euro-Par.

[10]  Samuel Williams,et al.  Auto-Tuning the 27-point Stencil for Multicore , 2009 .

[11]  Bradley C. Kuszmaul,et al.  The pochoir stencil compiler , 2011, SPAA '11.

[12]  David E. Keyes,et al.  Multicore-Optimized Wavefront Diamond Blocking for Optimizing Stencil Updates , 2014, SIAM J. Sci. Comput..

[13]  Junichiro Makino,et al.  Optimal Temporal Blocking for Stencil Computation , 2015, ICCS.

[14]  Alejandro Duran,et al.  Extending OpenMP* with Vector Constructs for Modern Multicore SIMD Architectures , 2012, IWOMP.

[15]  Marcin Dabrowski,et al.  Efficient 3D stencil computations using CUDA , 2013, Parallel Comput..

[16]  Volker Strumpen,et al.  Cache oblivious stencil computations , 2005, ICS '05.