Towards a Multi-Level Cache Performance Model for 3D Stencil Computation

Abstract It is crucial to optimize stencil computations since they are the core (and most computational demanding segment) of many Scientific Computing applications, therefore reducing overall execution time. This is not a simple task, actually it is lengthy and tedious. It is lengthy because the large number of stencil optimizations combinations to test, which might consume days of computing time, and the process is tedious due to the slightly different versions of code to implement. Alternatively, models that predict performance can be built without any actual stencil execution, thus reducing the cumbersome optimization task. Previous works have proposed cache misses and execution time models for specific stencil optimizations. Furthermore, most of them have been designed for 2D datasets or stencil sizes that only suit low order numerical schemes. We propose a flexible and accurate model for a wide range of stencil sizes up to high order schemes, that captures the behavior of 3D stencil computations using platform parameters. The model has been tested in a group of representative hardware architectures, using realistic dataset sizes. Our model predicts successfully stencil execution times and cache misses. However, predictions accuracy depends on the platform, for instance on x86 architectures prediction errors ranges between 1-20%. Therefore, the model is reliable and can help to speed up the stencil computation optimization process. To that end, other stencil optimization techniques can be added to this model, thus essentially providing a framework which covers most of the state-of-the-art.

[1]  Axel Brandenburg,et al.  Computational aspects of astrophysical MHD and turbulence , 2001, Advances in Nonlinear Dynamos.

[2]  Mauricio Hanzich,et al.  3D seismic imaging through reverse-time migration on homogeneous and heterogeneous multi-core processors , 2009, HiPC 2009.

[3]  Matthias S. Müller,et al.  Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[4]  Leonid Oliker,et al.  Impact of modern memory subsystems on cache optimizations for stencil computations , 2005, MSP '05.

[5]  Liu Peng,et al.  High-order stencil computations on multicore clusters , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[6]  Chau-Wen Tseng,et al.  Tiling Optimizations for 3D Scientific Computations , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[7]  Olivier Temam,et al.  Cache interference phenomena , 1994, SIGMETRICS.

[8]  Catherine de Groot-Hedlin,et al.  A FINITE DIFFERENCE SOLUTION TO THE HELMHOLTZ EQUATION IN A RADIALLY SYMMETRIC WAVEGUIDE: APPLICATION TO NEAR-SOURCE SCATTERING IN OCEAN ACOUSTICS , 2008 .

[9]  Samuel Williams,et al.  Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors , 2007, SIAM Rev..

[10]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[11]  José María Cela,et al.  Introducing the Semi-stencil Algorithm , 2009, PPAM.