Asynchronous computations for solving the acoustic wave propagation equation

The aim of this study is to design and implement an asynchronous computational scheme for solving the acoustic wave propagation equation with absorbing boundary conditions (ABCs) in the context of seismic imaging applications. While the convolutional perfectly matched layer (CPML) is typically used for ABCs in the oil and gas industry, its formulation further stresses memory accesses and decreases the arithmetic intensity at the physical domain boundaries. The challenges with CPML are twofold: (1) the strong, inherent data dependencies imposed on the explicit time-stepping scheme render asynchronous time integration cumbersome and (2) the idle time is further exacerbated by the load imbalance introduced among processing units. In fact, the CPML formulation of the ABCs requires expensive synchronization points, which may hinder the parallel performance of the overall asynchronous time integration. In particular, when deployed in conjunction with the multicore-optimized wavefront diamond temporal blocking (MWD-TB) approach for the inner domain points, it results in a major performance slow down. To relax CPML’s synchrony and mitigate the resulting load imbalance, we embed CPML’s calculation into MWD-TB’s inner loop and carry on the time integration with fine-grained computations in an asynchronous, holistic way. This comes at the price of storing transient results to alleviate dependencies from critical data hazards while maintaining the numerical accuracy of the original scheme. Performance and scalability results on various x86 architectures demonstrate the superiority of MWD-TB with CPML support against the standard spatial blocking on various grid sizes. To our knowledge, this is the first practical study that highlights the consolidation of CPML ABCs with asynchronous temporal blocking stencil computations.

[1]  C. Andreolli,et al.  Optimization of the seismic modeling with the time-domain finite-difference method , 2014 .

[2]  David E. Keyes,et al.  Multidimensional Intratile Parallelization for Memory-Starved Stencil Computations , 2015, ACM Trans. Parallel Comput..

[3]  Samuel Kortas,et al.  High-Performance Seismic Modeling with Finite-Difference Using Spatial and Temporal Cache Blocking , 2017 .

[4]  J. Virieux P-SV wave propagation in heterogeneous media: Velocity‐stress finite‐difference method , 1986 .

[5]  Albert Cohen,et al.  The Relation Between Diamond Tiling and Hexagonal Tiling , 2014, Parallel Process. Lett..

[6]  Gerhard Wellein,et al.  Efficient Temporal Blocking for Stencil Computations by Multicore-Aware Wavefront Parallelization , 2009, 2009 33rd Annual IEEE International Computer Software and Applications Conference.

[7]  David G. Wonnacott,et al.  Using time skewing to eliminate idle time due to memory bandwidth and network limitations , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[8]  Hamzah A. Almoghrabi Layers and bright spots , 1986 .

[9]  Volker Strumpen,et al.  Cache oblivious stencil computations , 2005, ICS '05.

[10]  George A. McMechan,et al.  3D ACOUSTIC PRESTACK REVERSE‐TIME MIGRATION1 , 1990 .

[11]  Hans-Peter Seidel,et al.  Cache Accurate Time Skewing in Iterative Stencil Computations , 2011, 2011 International Conference on Parallel Processing.

[12]  Katherine Yelick,et al.  Auto-tuning stencil codes for cache-based multicore platforms , 2009 .

[13]  A. Levander Fourth-order finite-difference P-SV seismograms , 1988 .

[14]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[15]  Jean Virieux,et al.  An overview of full-waveform inversion in exploration geophysics , 2009 .

[16]  P. Sadayappan,et al.  High-performance code generation for stencil computations on GPU architectures , 2012, ICS '12.

[17]  D. Komatitsch,et al.  An unsplit convolutional perfectly matched layer improved at grazing incidence for the seismic wave equation , 2007 .

[18]  Sofya Titarenko,et al.  Hybrid multicore/vectorisation technique applied to the elastic wave equation on a staggered grid , 2017, Comput. Phys. Commun..

[19]  Cristina Boeres,et al.  An Approach to Optimise the Execution of RTM Algorithm in Multicore Machines , 2011, 2011 IEEE Seventh International Conference on eScience.

[20]  Uday Bondhugula,et al.  Tiling stencil computations to maximize parallelism , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[21]  Richard Veras,et al.  A stencil compiler for short-vector SIMD architectures , 2013, ICS '13.

[22]  Shan Huang,et al.  Tessellating Stencils , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[23]  Helmar Burkhart,et al.  PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[24]  Xing Zhou,et al.  Tiling optimizations for stencil computations , 2013 .

[25]  David E. Keyes,et al.  Multicore-Optimized Wavefront Diamond Blocking for Optimizing Stencil Updates , 2014, SIAM J. Sci. Comput..

[26]  Bradley C. Kuszmaul,et al.  The pochoir stencil compiler , 2011, SPAA '11.

[27]  Pradeep Dubey,et al.  3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[28]  Gerhard Wellein,et al.  LIKWID: A Lightweight Performance-Oriented Tool Suite for x86 Multicore Environments , 2010, 2010 39th International Conference on Parallel Processing Workshops.

[29]  Baoli Wang,et al.  RTM using effective boundary saving: A staggered grid GPU implementation , 2014, Comput. Geosci..

[30]  Gerhard Wellein,et al.  LIKWID: Lightweight Performance Tools , 2011, CHPC.

[31]  Hervé Chauris,et al.  Tips and tricks for Finite difference and i/o-less FWI , 2011 .

[32]  E. Baysal,et al.  Reverse time migration , 1983 .