Leveraging performance of 3D finite difference schemes in large scientific computing simulations

Gone are the days when engineers and scientists conducted most of their experiments empirically. During these decades, actual tests were carried out in order to assess the robustness and reliability of forthcoming product designs and prove theoretical models. With the advent of the computational era, scientific computing has definetely become a feasible solution compared with empirical methods, in terms of effort, cost and reliability. Large and massively parallel computational resources have reduced the simulation execution times and have improved their numerical results due to the refinement of the sampled domain. Several numerical methods coexist for solving the Partial Differential Equations (PDEs). Methods such as the Finite Element (FE) and the Finite Volume (FV) are specially well suited for dealing with problems where unstructured meshes are frequent. Unfortunately, this flexibility is not bestowed for free. These schemes entail higher memory latencies due to the handling of irregular data accesses. Conversely, the Finite Difference (FD) scheme has shown to be an efficient solution for problems where the structured meshes suit the domain requirements. Many scientific areas use this scheme due to its higher performance. This thesis focuses on improving FD schemes to leverage the performance of large scientific computing simulations. Different techniques are proposed such as the Semi-stencil, a novel algorithm that increases the FLOP/Byte ratio for medium- and high-order stencils operators by reducing the accesses and endorsing data reuse. The algorithm is orthogonal and can be combined with techniques such as spatial- or time-blocking, adding further improvement. New trends on Symmetric Multi-Processing (SMP) systems -where tens of cores are replicated on the same die- pose new challenges due to the exacerbation of the memory wall problem. In order to alleviate this issue, our research is focused on different strategies to reduce pressure on the cache hierarchy, particularly when different threads are sharing resources due to Simultaneous Multi-Threading (SMT). Several domain decomposition schedulers for work-load balance are introduced ensuring quasi-optimal results without jeopardizing the overall performance. We combine these schedulers with spatial-blocking and auto-tuning techniques, exploring the parametric space and reducing misses in last level cache. As alternative to brute-force methods used in auto-tuning, where a huge parametric space must be traversed to find a suboptimal candidate, performance models are a feasible solution. Performance models can predict the performance on different architectures, selecting suboptimal parameters almost instantly. In this thesis, we devise a flexible and extensible performance model for stencils. The proposed model is capable of supporting multi- and many-core architectures including complex features such as hardware prefetchers, SMT context and algorithmic optimizations. Our model can be used not only to forecast execution time, but also to make decisions about the best algorithmic parameters. Moreover, it can be included in run-time optimizers to decide the best SMT configuration based on the execution environment. Some industries rely heavily on FD-based techniques for their codes. Nevertheless, many cumbersome aspects arising in industry are still scarcely considered in academia research. In this regard, we have collaborated in the implementation of a FD framework which covers the most important features that an HPC industrial application must include. Some of the node-level optimization techniques devised in this thesis have been included into the framework in order to contribute in the overall application performance. We show results for a couple of strategic applications in industry: an atmospheric transport model that simulates the dispersal of volcanic ash and a seismic imaging model used in Oil & Gas industry to identify hydrocarbon-rich reservoirs.

[1]  Robert Strzodka,et al.  Impact of System and Cache Bandwidth on Stencil Computations Across Multiple Processor Generations , 2011 .

[2]  Olivier Temam,et al.  Cache interference phenomena , 1994, SIGMETRICS.

[3]  A three dimensional global weather prediction model using a finite element scheme for vertical discretization , 1989 .

[4]  Alok N. Choudhary,et al.  Improved parallel I/O via a two-phase run-time access strategy , 1993, CARN.

[5]  Mauricio Hanzich,et al.  Unveiling WARIS Code, a Parallel and Multi-purpose FDM Framework , 2013, ENUMATH.

[6]  A. Prieto,et al.  Perfectly matched layers for modelling seismic oceanography experiments , 2008 .

[7]  Todd C. Mowry,et al.  Tolerating latency through software-controlled data prefetching , 1994 .

[8]  Leonid Oliker,et al.  Impact of modern memory subsystems on cache optimizations for stencil computations , 2005, MSP '05.

[9]  Zhenman Fang,et al.  Multi-stage coordinated prefetching for present-day processors , 2014, ICS '14.

[10]  John D. McCalpin,et al.  Time Skewing: A Value-Based Approach to Optimizing for Memory Locality , 1999 .

[11]  Chau-Wen Tseng,et al.  Improving data locality with loop transformations , 1996, TOPL.

[12]  Volker Strumpen,et al.  Cache oblivious stencil computations , 2005, ICS '05.

[13]  Gerhard Wellein,et al.  LIKWID: Lightweight Performance Tools , 2011, CHPC.

[14]  Marianne Winslett,et al.  Improving MPI-IO output performance with active buffering plus threads , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[15]  Georg Hager,et al.  Introducing a Performance Model for Bandwidth-Limited Loop Kernels , 2009, PPAM.

[16]  Samuel Williams,et al.  An auto-tuning framework for parallel multicore stencil computations , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[17]  Liu Peng,et al.  High-order stencil computations on multicore clusters , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[18]  Chau-Wen Tseng,et al.  Tiling Optimizations for 3D Scientific Computations , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[19]  Samuel Williams,et al.  Implicit and explicit optimizations for stencil computations , 2006, MSPC '06.

[20]  Patrick R. Amestoy,et al.  3D Frequency-domain Finite-difference Modeling of Acoustic Wave Propagation Using a Massively Parallel Direct Solver: a Feasibility Study , 2005 .

[21]  Mauricio Hanzich,et al.  Evaluation of 3D RTM On HPC Platforms , 2008 .

[22]  Tarek S. Abdelrahman,et al.  Fusion of Loops for Parallelism and Locality , 1997, IEEE Trans. Parallel Distributed Syst..

[23]  Samuel Williams,et al.  Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures , 2008 .

[24]  Collin McCurdy,et al.  Diagnosis and optimization of application prefetching performance , 2013, ICS '13.

[25]  Eduard Ayguadé,et al.  Exploiting memory customization in FPGA for 3D stencil computations , 2009, 2009 International Conference on Field-Programmable Technology.

[26]  David G. Wonnacott,et al.  Time Skewing for Parallel Computers , 1999, LCPC.

[27]  Arnau Folch,et al.  FALL3D: A computational model for transport and deposition of volcanic ash , 2009, Comput. Geosci..

[28]  Arnau Folch,et al.  Volcanic ash over Europe during the eruption of Eyjafjallajökull on Iceland, April–May 2010 , 2012 .

[29]  V. Thomée From finite differences to finite elements a short history of numerical analysis of partial differential equations , 2001 .

[30]  Samuel Williams,et al.  Auto-tuning performance on multicore computers , 2008 .

[31]  Apan Qasem,et al.  Understanding stencil code performance on multicore architectures , 2011, CF '11.

[32]  George A. McMechan,et al.  A review of seismic acoustic imaging by reverse‐time migration , 1989, Int. J. Imaging Syst. Technol..

[33]  W. L. Ko,et al.  Reentry heat transfer analysis of the space shuttle orbiter , 1982 .

[34]  Collin McCurdy,et al.  Characterizing the Impact of Prefetching on Scientific Application Performance , 2013, PMBS@SC.

[35]  George Ho,et al.  PAPI: A Portable Interface to Hardware Performance Counters , 1999 .

[36]  Gerhard Wellein,et al.  Multi-core architectures: Complexities of performance prediction and the impact of cache topology , 2009, ArXiv.

[37]  Jianbin Fang,et al.  An Empirical Study of Intel Xeon Phi , 2013, ArXiv.

[38]  Robin L. Dennis,et al.  NARSTO critical review of photochemical models and modeling , 2000 .

[39]  Anne Rogers,et al.  Software support for speculative loads , 1992, ASPLOS V.

[40]  Volker Strumpen,et al.  The cache complexity of multithreaded cache oblivious algorithms , 2006, SPAA.

[41]  Alfons G. Hoekstra,et al.  Efficient analytical modelling of multi-level set-associative caches , 1999 .

[42]  Catherine de Groot-Hedlin,et al.  A FINITE DIFFERENCE SOLUTION TO THE HELMHOLTZ EQUATION IN A RADIALLY SYMMETRIC WAVEGUIDE: APPLICATION TO NEAR-SOURCE SCATTERING IN OCEAN ACOUSTICS , 2008 .

[43]  Michael E. Wolf,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[44]  Katherine Yelick,et al.  Performance Modeling and Analysis of Cache Blocking in Sparse Matrix Vector Multiply , 2004 .