Algorithm 942

Finite Difference (FD) is a widely used method to solve Partial Differential Equations (PDE). PDEs are the core of many simulations in different scientific fields, such as geophysics, astrophysics, etc. The typical FD solver performs stencil computations for the entire computational domain, thus solving the differential operators. In general terms, the stencil computation consists of a weighted accumulation of the contribution of neighbor points along the cartesian axis. Therefore, optimizing stencil computations is crucial in reducing the application execution time. Stencil computation performance is bounded by two main factors: the memory access pattern and the inefficient reuse of the accessed data. We propose a novel algorithm, named Semi-stencil, that tackles these two problems. The main idea behind this algorithm is to change the way in which the stencil computation progresses within the computational domain. Instead of accessing all required neighbors and adding all their contributions at once, the Semi-stencil algorithm divides the computation into several updates. Then, each update gathers half of the axis neighbors, partially computing at the same time the stencil in a set of closely located points. As Semi-stencil progresses through the domain, the stencil computations are completed on precomputed points. This computation strategy improves the memory access pattern and efficiently reuses the accessed data. Our initial target architecture was the Cell/B.E., where the Semi-stencil in a SPE was 44% faster than the naive stencil implementation. Since then, we have continued our research on emerging multicore architectures in order to assess and extend this work on homogeneous architectures. The experiments presented combine the Semi-stencil strategy with space- and time-blocking algorithms used in hierarchical memory architectures. Two x86 (Intel Nehalem and AMD Opteron) and two POWER (IBM POWER6 and IBM BG/P) platforms are used as testbeds, where the best improvements for a 25-point stencil range from 1.27 to 1.76× faster. The results show that this novel strategy is a feasible optimization method which may be integrated into auto-tuning frameworks. Also, since all current architectures are multicore based, we have introduced a brief section where scalability results on IBM POWER7-, Intel Xeon-, and MIC-based systems are presented. In a nutshell, the algorithm scales as well as or better than other stencil techniques. For instance, the scalability of Semi-stencil on MIC for a certain testcase reached 93.8 × over 244 threads.

[1]  Mark F. Adams,et al.  Chombo Software Package for AMR Applications Design Document , 2014 .

[2]  Volker Strumpen,et al.  The Cache Complexity of Multithreaded Cache Oblivious Algorithms , 2009, SPAA '06.

[3]  Chau-Wen Tseng,et al.  Tiling Optimizations for 3D Scientific Computations , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[4]  José María Cela,et al.  Introducing the Semi-stencil Algorithm , 2009, PPAM.

[5]  Mauricio Araya-Polo,et al.  Towards a Multi-Level Cache Performance Model for 3D Stencil Computation , 2011, ICCS.

[6]  Catherine de Groot-Hedlin,et al.  A FINITE DIFFERENCE SOLUTION TO THE HELMHOLTZ EQUATION IN A RADIALLY SYMMETRIC WAVEGUIDE: APPLICATION TO NEAR-SOURCE SCATTERING IN OCEAN ACOUSTICS , 2008 .

[7]  Samuel Williams,et al.  Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors , 2007, SIAM Rev..

[8]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[9]  George Ho,et al.  PAPI: A Portable Interface to Hardware Performance Counters , 1999 .

[10]  X. Andrade,et al.  Efficient formalism for large-scale ab initio molecular dynamics based on time-dependent density functional theory. , 2007, Physical review letters.

[11]  Samuel Williams,et al.  Auto-Tuning Stencil Computations on Multicore and Accelerators , 2010, Scientific Computing with Multicore and Accelerators.

[12]  Mauricio Hanzich,et al.  3D seismic imaging through reverse-time migration on homogeneous and heterogeneous multi-core processors , 2009, Sci. Program..

[13]  Ken Kennedy,et al.  Software prefetching , 1991, ASPLOS IV.

[14]  Anoop Gupta,et al.  Tolerating Latency Through Software-Controlled Prefetching in Shared-Memory Multiprocessors , 1991, J. Parallel Distributed Comput..

[15]  Georg Hager,et al.  Introducing a Performance Model for Bandwidth-Limited Loop Kernels , 2009, PPAM.

[16]  Axel Brandenburg,et al.  Computational aspects of astrophysical MHD and turbulence , 2001, Advances in Nonlinear Dynamos.

[17]  Gerhard Wellein,et al.  LIKWID: A Lightweight Performance-Oriented Tool Suite for x86 Multicore Environments , 2010, 2010 39th International Conference on Parallel Processing Workshops.

[18]  John D. McCalpin,et al.  Time Skewing: A Value-Based Approach to Optimizing for Memory Locality , 1999 .

[19]  H. Appel,et al.  octopus: a tool for the application of time‐dependent density functional theory , 2006 .

[20]  Volker Strumpen,et al.  Cache oblivious stencil computations , 2005, ICS '05.

[21]  Gerhard Wellein,et al.  Complexities of Performance Prediction for Bandwidth-Limited Loop Kernels on Multi-Core Architectures , 2010 .

[22]  George A. McMechan,et al.  A review of seismic acoustic imaging by reverse‐time migration , 1989, Int. J. Imaging Syst. Technol..

[23]  Patrick R. Amestoy,et al.  3D Frequency-domain Finite-difference Modeling of Acoustic Wave Propagation Using a Massively Parallel Direct Solver: a Feasibility Study , 2005 .

[24]  David G. Wonnacott,et al.  Time Skewing for Parallel Computers , 1999, LCPC.

[25]  Tarek S. Abdelrahman,et al.  Fusion of Loops for Parallelism and Locality , 1997, IEEE Trans. Parallel Distributed Syst..

[26]  Olivier Temam,et al.  Cache interference phenomena , 1994, SIGMETRICS.

[27]  A. Prieto,et al.  Perfectly matched layers for modelling seismic oceanography experiments , 2008 .

[28]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[29]  Chau-Wen Tseng,et al.  Improving data locality with loop transformations , 1996, TOPL.

[30]  Vicki H. Allan,et al.  Software pipelining , 1995, CSUR.

[31]  Anne Rogers,et al.  Software support for speculative loads , 1992, ASPLOS V.

[32]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[33]  Todd C. Mowry,et al.  Tolerating latency through software-controlled data prefetching , 1994 .

[34]  Leonid Oliker,et al.  Impact of modern memory subsystems on cache optimizations for stencil computations , 2005, MSP '05.

[35]  Samuel Williams,et al.  Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures , 2008 .

[36]  Samuel Williams,et al.  Implicit and explicit optimizations for stencil computations , 2006, MSPC '06.