Asynchronous Task-Based Execution of the Reverse Time Migration for the Oil and Gas Industry

We propose a new framework for deploying Reverse Time Migration (RTM) simulations on distributed-memory systems equipped with multiple GPUs. Our software, TB-RTM, infrastructure engine relies on the StarPU dynamic runtime system to orchestrate the asynchronous scheduling of RTM computational tasks on the underlying resources. Besides dealing with the challenging hardware heterogeneity, TB-RTM supports tasks with different workload characteristics, which stress disparate components of the hardware system. RTM is challenging in that it operates intensively at both ends of the memory hierarchy, with compute kernels running at the highest level of the memory system, possibly in GPU main memory, while I/O kernels are saving solution data to fast storage. We consider how to span the wide performance gap between the two extreme ends of the memory system, i.e., GPU memory and fast storage, on which large-scale RTM simulations routinely execute. To maximize hardware occupancy while maintaining high memory bandwidth throughout the memory subsystem, our framework presents the new-of-core (OOC) feature from StarPU to prefetch data solutions in and out not only from/to the GPU/CPU main memory but also from/to the fast storage system. The OOC technique may trigger opportunities for overlapping expensive data movement with computations. TB-RTM framework addresses this challenging problem of heterogeneity with a systematic approach that is oblivious to the targeted hardware architectures. Our resulting RTM framework can effectively be deployed on massively parallel GPU-based systems, while delivering performance scalability up to 500 GPUs.

[1]  Emmanuel Agullo,et al.  Achieving High Performance on Supercomputers with a Sequential Task-based Programming Model , 2017 .

[2]  David E. Keyes,et al.  Multidimensional Intratile Parallelization for Memory-Starved Stencil Computations , 2015, ACM Trans. Parallel Comput..

[3]  Samuel Kortas,et al.  High-Performance Seismic Modeling with Finite-Difference Using Spatial and Temporal Cache Blocking , 2017 .

[4]  D. Komatitsch,et al.  An unsplit convolutional perfectly matched layer improved at grazing incidence for the seismic wave equation , 2007 .

[5]  D. Wonnacott,et al.  On the Scalability of Loop Tiling Techniques , 2012 .

[6]  P. Sadayappan,et al.  High-performance code generation for stencil computations on GPU architectures , 2012, ICS '12.

[7]  Robert A. van de Geijn,et al.  Rapid Development of High-Performance Out-of-Core Solvers , 2004, PARA.

[8]  Robert B. Ross,et al.  PIDX: Efficient Parallel I/O for Multi-resolution Multi-dimensional Scientific Datasets , 2011, 2011 IEEE International Conference on Cluster Computing.

[9]  Jeffrey Scott Vitter,et al.  Algorithms and Data Structures for External Memory , 2008, Found. Trends Theor. Comput. Sci..

[10]  Guang R. Gao,et al.  Diamond Tiling: A Tiling Framework for Time-iterated Scientic Applications , 2009 .

[11]  David G. Wonnacott,et al.  Using time skewing to eliminate idle time due to memory bandwidth and network limitations , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[12]  R. Kosloff,et al.  Absorbing boundaries for wave propagation problems , 1986 .

[13]  A Thesis,et al.  Tiling Stencil Computations to Maximize Parallelism , 2013 .

[14]  Cristina Boeres,et al.  An Approach to Optimise the Execution of RTM Algorithm in Multicore Machines , 2011, 2011 IEEE Seventh International Conference on eScience.

[15]  Guang R. Gao,et al.  Locality Optimization of Stencil Applications Using Data Dependency Graphs , 2010, LCPC.

[16]  Stavros Papadopoulos,et al.  The TileDB Array Data Storage Manager , 2016, Proc. VLDB Endow..

[17]  Sofya Titarenko,et al.  Hybrid multicore/vectorisation technique applied to the elastic wave equation on a staggered grid , 2017, Comput. Phys. Commun..

[18]  Charles Yount,et al.  Architecture and Performance of Devito, a System for Automated Stencil Computation , 2018, ACM Trans. Math. Softw..

[19]  Xing Zhou,et al.  Tiling optimizations for stencil computations , 2013 .

[20]  William W. Symes,et al.  Reverse time migration with optimal checkpointing , 2007 .

[21]  David E. Keyes,et al.  Multicore-Optimized Wavefront Diamond Blocking for Optimizing Stencil Updates , 2014, SIAM J. Sci. Comput..

[22]  Fabrice Dupros,et al.  Task-Based Programming on Emerging Parallel Architectures for Finite-Differences Seismic Numerical Kernel , 2018, Euro-Par.

[23]  S. Brandsberg-Dahl High-performance computing for seismic imaging; from shoestrings to the cloud , 2017 .

[24]  Felix J. Herrmann,et al.  Devito: an embedded domain-specific language for finite differences and geophysical exploration , 2018, Geoscientific Model Development.

[25]  Richard Veras,et al.  A stencil compiler for short-vector SIMD architectures , 2013, ICS '13.

[26]  Jack J. Dongarra,et al.  A Framework for Out of Memory SVD Algorithms , 2017, ISC.

[27]  David E. Keyes,et al.  Application of High Performance Asynchronous Acoustic Wave Equation Stencil Solver into a Land Survey , 2019, Day 3 Wed, March 20, 2019.

[28]  Albert Cohen,et al.  The Relation Between Diamond Tiling and Hexagonal Tiling , 2014, Parallel Process. Lett..

[29]  Pradeep Dubey,et al.  3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[30]  Salim Hariri,et al.  Task scheduling algorithms for heterogeneous processors , 1999, Proceedings. Eighth Heterogeneous Computing Workshop (HCW'99).

[31]  Dimitri Komatitsch,et al.  Accelerating a three-dimensional finite-difference wave propagation code using GPU graphics cards , 2010 .

[32]  Hervé Chauris,et al.  Tips and tricks for Finite difference and i/o-less FWI , 2011 .

[33]  Jeffrey Scott Vitter,et al.  External memory algorithms and data structures: dealing with massive data , 2001, CSUR.

[34]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[35]  Shan Huang,et al.  Tessellating Stencils , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[36]  C. Andreolli,et al.  Optimization of the seismic modeling with the time-domain finite-difference method , 2014 .

[37]  Hans-Peter Seidel,et al.  Cache Accurate Time Skewing in Iterative Stencil Computations , 2011, 2011 International Conference on Parallel Processing.

[38]  Albert Cohen,et al.  Hybrid Hexagonal/Classical Tiling for GPUs , 2014, CGO '14.

[39]  Katherine Yelick,et al.  Auto-tuning stencil codes for cache-based multicore platforms , 2009 .

[40]  Maxime Crochemore,et al.  External Memory Algorithms for String Problems , 2008, Fundam. Informaticae.

[41]  E. Baysal,et al.  Reverse time migration , 1983 .

[42]  T. Okamoto,et al.  Accelerating large-scale simulation of seismic wave propagation by multi-GPUs and three-dimensional domain decomposition , 2010 .

[43]  Uday Bondhugula,et al.  A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.

[44]  Gerhard Wellein,et al.  Efficient Temporal Blocking for Stencil Computations by Multicore-Aware Wavefront Parallelization , 2009, 2009 33rd Annual IEEE International Computer Software and Applications Conference.