Contributions of hybrid architectures to depth imaging: a CPU, APU and GPU comparative study. (Apports des architectures hybrides à l'imagerie profondeur : étude comparative entre CPU, APU et GPU)

In an exploration context, Oil and Gas (O&G) companies rely on HPC to accelerate depth imaging algorithms. Solutions based on CPU clusters and hardware accelerators are widely embraced by the industry. The Graphics Processing Units (GPUs), with a huge compute power and a high memory bandwidth, had attracted significant interest.However, deploying heavy imaging workflows, the Reverse Time Migration (RTM) being the most famous, on such hardware had suffered from few limitations. Namely, the lack of memory capacity, frequent CPU-GPU communications that may be bottlenecked by the PCI transfer rate, and high power consumptions. Recently, AMD has launched theAccelerated Processing Unit (APU): a processor that merges a CPU and a GPU on the same die, with promising features notably a unified CPU-GPU memory. Throughout this thesis, we explore how efficiently may the APU technology be applicable in an O&G context, and study if it can overcome the limitations that characterize the CPU and GPU based solutions. The APU is evaluated with the help of memory, applicative and power efficiency OpenCL benchmarks. The feasibility of the hybrid utilization of the APUs is surveyed. The efficiency of a directive based approach is also investigated. By means of a thorough review of a selection of seismic applications (modeling and RTM) on the node level and on the large scale level, a comparative study between the CPU, the APU and the GPU is conducted. We show the relevance of overlapping I/O and MPI communications with computations for the APU and GPUclusters, that APUs deliver performances that range between those of CPUs and those of GPUs, and that the APU can be as power efficient as the GPU.

[1]  Pradeep Dubey,et al.  3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[2]  Gerhard Wellein,et al.  Parallel Sparse Matrix-Vector Multiplication as a Test Case for Hybrid MPI+OpenMP Programming , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[3]  Marc Tchiboukdjian,et al.  Design and Performance of an Intel Xeon Phi based Cluster for Reverse Time Migration , 2014, HiPC 2014.

[4]  William Jalby,et al.  Quantifying performance bottleneck cost through differential analysis , 2013, ICS '13.

[5]  D. Komatitsch,et al.  An unsplit convolutional perfectly matched layer improved at grazing incidence for the seismic wave equation , 2007 .

[6]  Gabriella Cabitza,et al.  Migration of seismic data , 1994 .

[7]  Vidar Slåtten,et al.  379 Performance Optimizations for TTI RTM on GPU based Hybrid Architectures , 2013 .

[8]  D. Yingst,et al.  Full waveform inversion – the state of the art , 2013 .

[9]  Rudolf Eigenmann,et al.  OpenMPC: Extended OpenMP Programming and Tuning for GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[10]  Wu-chun Feng,et al.  Towards efficient supercomputing: a quest for the right metric , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[11]  Gerhard Wellein,et al.  Asynchronous MPI for the Masses , 2013, ArXiv.

[12]  Zhiming Li,et al.  A Multi-Step Approach For Efficient Reverse-Time Migration , 2008 .

[13]  Yanfei Wang,et al.  Determining finite difference weights for the acoustic wave equation by a new dispersion‐relationship‐preserving method , 2015 .

[14]  Zhaohui S. Qin,et al.  GPUmotif: An Ultra-Fast and Energy-Efficient Motif Analysis Program Using Graphics Processing Units , 2012, PloS one.

[15]  T. Okamoto,et al.  Accelerating large-scale simulation of seismic wave propagation by multi-GPUs and three-dimensional domain decomposition , 2010 .

[16]  Satoshi Matsuoka,et al.  A Multi-Level Optimization Method for Stencil Computation on the Domain that is Bigger than Memory Capacity of GPU , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[17]  Lurng-Kuo Liu,et al.  High Performance RTM Using Massive Domain Partitioning , 2011 .

[18]  Gerhard Wellein,et al.  Multicore-aware parallel temporal blocking of stencil codes for shared and distributed memory , 2009, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[19]  Pavan Balaji,et al.  MT-MPI: multithreaded MPI for many-core environments , 2014, ICS '14.

[20]  Weiqiang Wang,et al.  A Multilevel Parallelization Framework for High-Order Stencil Computations , 2009, Euro-Par.

[21]  John A. Scales,et al.  Distributed three-dimensional finite-difference modeling of wave propagation in acoustic media , 1997 .

[22]  R. Courant,et al.  On the Partial Difference Equations, of Mathematical Physics , 2015 .

[23]  Etienne Robein ebook - Seismic Imaging: A Review of the Techniques, their Principles, Merits and Limitations (EET 4) , 2010 .

[24]  Haohuan Fu,et al.  Selecting the right hardware for reverse time migration , 2010 .

[25]  Satoshi Matsuoka,et al.  Physis: An implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[26]  Jean Virieux,et al.  An overview of full-waveform inversion in exploration geophysics , 2009 .

[27]  Scott Hauck,et al.  Reconfigurable Computing: The Theory and Practice of FPGA-Based Computation , 2007 .

[28]  Aditya Konduri,et al.  Asynchronous finite-difference schemes for partial differential equations , 2014, J. Comput. Phys..

[29]  G. Schuster Basics of Seismic Imaging , 2010 .

[30]  Hongwei Liu,et al.  Wavefield reconstruction methods for reverse time migration , 2013 .

[31]  Z. Alterman,et al.  Propagation of elastic waves in layered media by finite difference methods , 1968 .

[32]  William W. Symes,et al.  Computational Strategies For Reverse-time Migration , 2008 .

[33]  Song Huang,et al.  On the energy efficiency of graphics processing units for scientific computing , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[34]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[35]  Scott B. Baden,et al.  Overlapping communication and computation with OpenMP and MPI , 2001, Sci. Program..

[36]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[37]  Robert G. Clapp Reverse time migration : Saving the boundaries , 2009 .

[38]  Moujahed Al-Husseini,et al.  The debate over Hubbert’s Peak: a review , 2006, GeoArabia.

[39]  Reiji Suda,et al.  Accurate Measurements and Precise Modeling of Power Dissipation of CUDA Kernels toward Power Optimized High Performance CPU-GPU Computing , 2009, 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies.

[40]  Li-Yun Fu,et al.  Two effective approaches to reduce data storage in reverse time migration , 2013, Comput. Geosci..

[42]  W. A. Mulder,et al.  A comparison between one-way and two-way wave-equation migration , 2004 .

[43]  Brian Hamilton,et al.  ROOM ACOUSTICS MODELLING USING GPU-ACCELERATED FINITE DIFFERENCE AND FINITE VOLUME METHODS ON A FACE-CENTERED CUBIC GRID , 2013 .

[44]  Hiroyuki Takizawa,et al.  A Comparison of Performance Tunabilities between OpenCL and OpenACC , 2013, 2013 IEEE 7th International Symposium on Embedded Multicore Socs.

[45]  Eduard Ayguadé,et al.  Exploiting memory customization in FPGA for 3D stencil computations , 2009, 2009 International Conference on Field-Programmable Technology.

[46]  P. Moczo,et al.  The finite-difference time-domain method for modeling of seismic wave propagation , 2007 .

[47]  Stephen D. Gedney,et al.  Convolution PML (CPML): An efficient FDTD implementation of the CFS–PML for arbitrary media , 2000 .

[48]  Volker Strumpen,et al.  Cache oblivious stencil computations , 2005, ICS '05.

[49]  Changsoo Shin,et al.  Acceleration of stable TTI P-wave reverse-time migration with GPUs , 2013, Comput. Geosci..

[50]  R. Clapp Reverse time migration with random boundaries , 2009 .

[51]  Samuel Williams,et al.  Auto-Tuning the 27-point Stencil for Multicore , 2009 .

[52]  Inanc Senocak,et al.  An MPI-CUDA Implementation for Massively Parallel Incompressible Flow Computations on Multi-GPU Clusters , 2010 .

[53]  Dheeraj Bhardwaj,et al.  3 D Seismic Modeling in a Message Passing Environment , 2000 .

[54]  G. Keller An Introduction to Geophysical Exploration , 1986 .

[55]  Robert A. van de Geijn,et al.  Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures , 2012, IEEE Transactions on Computers.

[56]  Wayne Luk,et al.  A mixed precision Monte Carlo methodology for reconfigurable accelerator systems , 2012, FPGA '12.

[57]  H. Peter Hofstee,et al.  Introduction to the Cell multiprocessor , 2005, IBM J. Res. Dev..

[58]  Hans-Peter Seidel,et al.  Cache Accurate Time Skewing in Iterative Stencil Computations , 2011, 2011 International Conference on Parallel Processing.

[59]  Andreas Lemmer,et al.  Parallel domain decomposition method with non-blocking communication for flow through porous media , 2015, J. Comput. Phys..

[60]  Francesc Alted,et al.  Why Modern CPUs Are Starving and What Can Be Done about It , 2010, Computing in Science & Engineering.

[61]  Paulius Micikevicius,et al.  3D finite difference computation on GPUs using CUDA , 2009, GPGPU-2.

[62]  Wu-chun Feng,et al.  On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing , 2011, 2011 Symposium on Application Accelerators in High-Performance Computing.

[63]  Murray Cole,et al.  PARTANS: An autotuning framework for stencil computation on multi-GPU systems , 2013, TACO.

[64]  Hong Liu,et al.  The Algorithm of High Order Finite Difference Pre‐Stack Reverse Time Migration and GPU Implementation , 2010 .

[65]  Masakazu Sekijima,et al.  The Power Efficiency of GPUs in Multi Nodes Environment with Molecular Dynamics , 2011, 2011 40th International Conference on Parallel Processing Workshops.

[66]  Henri Calandra,et al.  Performance of CPU/GPU compiler directives on ISO/TTI kernels , 2013, Computing.

[67]  Mauricio Hanzich,et al.  High-Performance Seismic Acoustic Imaging by Reverse-Time Migration on the Cell / B . E . Architecture , 2008 .

[68]  J. Claerbout Toward a unified theory of reflector mapping , 1971 .

[69]  Rajeev Thakur,et al.  Test suite for evaluating performance of multithreaded MPI communication , 2009, Parallel Comput..

[70]  Douglas N. Arnold Stability, consistency, and convergence of numerical discretizations , 2015 .

[71]  Asma Farjallah,et al.  Preparing depth imaging applications for Exascale challenges and impacts. (Etude de l'adéquation des machines Exascale pour les algorithmes implémentant la méthode du Reverse Time Migation) , 2014 .

[72]  Jyothish Soman,et al.  Maximizing TTI RTM Throughput for CPU+GPU , 2013 .

[73]  Paul L. Stoffa,et al.  3D Seismic Modeling And Reverse-Time Migration With the Parallel Fourier Method Using Non-blocking Collective Communications , 2009 .

[74]  William A. Schneider,et al.  INTEGRAL FORMULATION FOR MIGRATION IN TWO AND THREE DIMENSIONS , 1978 .

[75]  Jean Virieux,et al.  Finite-difference frequency-domain modeling of viscoacoustic wave propagation in 2D tilted transversely isotropic (TTI) media , 2009 .

[76]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[77]  Dennis W. Prather,et al.  FPGA-based acceleration of the 3D finite-difference time-domain method , 2004, 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[78]  Samuel Williams,et al.  Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors , 2007, SIAM Rev..

[79]  Stephen W. Poole,et al.  Power measurement for high performance computing: State of the art , 2011, 2011 International Green Computing Conference and Workshops.

[80]  Sayantan Sur,et al.  RDMA read based rendezvous protocol for MPI over InfiniBand: design alternatives and benefits , 2006, PPoPP '06.

[81]  Larry Lines,et al.  A recipe for stability of finite-difference wave-equation computations , 1999 .

[82]  Paul L. Stoffa,et al.  Implicit finite-difference simulations of seismic wave propagation , 2012 .

[83]  Bo Li,et al.  The issues of prestack reverse time migration and solutions with Graphic Processing Unit implementation , 2012 .

[84]  G. McMechan MIGRATION BY EXTRAPOLATION OF TIME‐DEPENDENT BOUNDARY VALUES* , 1983 .

[85]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[86]  J. Gazdag,et al.  Migration of seismic data , 1984, Proceedings of the IEEE.

[87]  J. Carcione,et al.  Seismic modeling , 1942 .

[88]  Rached Abdelkhalek Accélération matérielle pour l'imagerie sismique : modélisation, migration et interprétation , 2013 .

[89]  R. Pratt Seismic waveform inversion in the frequency domain; Part 1, Theory and verification in a physical scale model , 1999 .

[90]  Haohuan Fu,et al.  Eliminating the memory bottleneck: an FPGA-based solution for 3d reverse time migration , 2011, FPGA '11.

[91]  Kevin Field The A List , 2016 .

[92]  Gerhard Wellein,et al.  Prospects for truly asynchronous communication with pure MPI and hybrid MPI/OpenMP on current supercomputing platforms , 2011 .

[93]  Henri Calandra,et al.  A review of the spectral, pseudo‐spectral, finite‐difference and finite‐element modelling techniques for geophysical imaging , 2011 .

[94]  John Jossey,et al.  EQUIVALENCE THEOREMS IN NUMERICAL ANALYSIS : INTEGRATION, DIFFERENTIATION AND INTERPOLATION , 2007, 0709.4046.

[95]  Weiqiang Wang,et al.  In-Core Optimization of High-Order Stencil Computations , 2009, PDPTA.

[96]  D. Hale Migration by the Kirchhoff, slant stack, and Gaussian beam methods , 1992 .

[97]  Larry Lines,et al.  Seismic Modeling and Imaging With the Complete Wave Equation , 1997 .

[98]  John C. Bancroft,et al.  Overcoming computational cost problems of reverse-time migration , 2010 .

[99]  Mi Lu,et al.  Time domain numerical simulation for transient waves on reconfigurable coprocessor platform , 2005, 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM'05).

[100]  Caroline Baldassari Modélisation et simulation numérique pour la migration terrestre par équation d'ondes. (Modelling and numerical simulation for land migration by wave equation) , 2009 .

[101]  R. Kosloff,et al.  Absorbing boundaries for wave propagation problems , 1986 .

[102]  A. Chorin Numerical solution of the Navier-Stokes equations , 1968 .

[103]  Ewing L. Lusk,et al.  Early Experiments with the OpenMP/MPI Hybrid Programming Model , 2008, IWOMP.

[104]  Lijian Tan,et al.  Time-Reversal Methods For RTM And FWI , 2011 .

[105]  Henri Calandra,et al.  Fast seismic modeling and reverse time migration on a graphics processing unit cluster , 2012, Concurr. Comput. Pract. Exp..

[106]  Andreas Griewank,et al.  Algorithm 799: revolve: an implementation of checkpointing for the reverse or adjoint mode of computational differentiation , 2000, TOMS.

[107]  Jenö Gazdag,et al.  Wave equation migration with the phase-shift method , 1978 .

[108]  Thorne Lay,et al.  Quantitative Seismology, Second Edition , 2003 .

[109]  Antoine Guitton,et al.  Shot-profile Migration of Multiple Reflections , 2002 .

[110]  Erich M. Nahum,et al.  Evaluating the impact of simultaneous multithreading on network servers using real hardware , 2005, SIGMETRICS '05.

[111]  Andreas Griewank,et al.  Achieving logarithmic growth of temporal and spatial complexity in reverse automatic differentiation , 1992 .

[112]  Georg Hager,et al.  Domain-Specific Optimization of Two Jacobi Smoother Kernels and Their Evaluation in the ECM Performance Model , 2014, Parallel Process. Lett..

[113]  B. Chapman,et al.  Energy Analysis of Parallel Scientific Kernels on Multiple GPUs , 2012, 2012 Symposium on Application Accelerators in High Performance Computing.

[114]  J. Coffeen,et al.  Seismic Exploration Fundamentals , 1978 .

[115]  Christian Märtin,et al.  Post-Dennard Scaling and the final Years of Moore ’ s Law Consequences for the Evolution of Multicore-Architectures , 2014 .

[116]  Jairo Panetta,et al.  Accelerating Kirchhoff Migration by CPU and GPU Cooperation , 2009, 2009 21st International Symposium on Computer Architecture and High Performance Computing.

[117]  Jean-Pierre Berenger,et al.  A perfectly matched layer for the absorption of electromagnetic waves , 1994 .

[118]  David A. Patterson,et al.  Computer Architecture, Fifth Edition: A Quantitative Approach , 2011 .

[119]  Spencer,et al.  3D Seismic Survey Design , 1995 .

[120]  Wang Chen,et al.  An FPGA implementation of the two-dimensional finite-difference time-domain (FDTD) algorithm , 2004, FPGA '04.

[121]  William J. Dally,et al.  Scaling the Power Wall: A Path to Exascale , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[122]  Tarek S. Abdelrahman,et al.  hiCUDA: High-Level GPGPU Programming , 2011, IEEE Transactions on Parallel and Distributed Systems.

[123]  Antonio Cisternino,et al.  Device specialization in heterogeneous multi-GPU environments , 2012, ICCSW.

[124]  N. Whitmore Iterative Depth Migration By Backward Time Propagation , 1983 .

[125]  J. Xu OpenCL – The Open Standard for Parallel Programming of Heterogeneous Systems , 2009 .

[126]  William W. Symes,et al.  Reverse time migration with optimal checkpointing , 2007 .

[127]  Peyman P. Moghaddam,et al.  Industrial-Scale Reverse Time Migration On GPU Hardware , 2009 .

[128]  David E. Keyes,et al.  Multicore-Optimized Wavefront Diamond Blocking for Optimizing Stencil Updates , 2014, SIAM J. Sci. Comput..

[129]  Lijian Tan,et al.  Time-reversal checkpointing methods for RTM and FWI , 2012 .

[130]  Junichiro Makino,et al.  Optimal Temporal Blocking for Stencil Computation , 2015, ICCS.

[131]  Satoshi Matsuoka,et al.  Peta-scale phase-field simulation for dendritic solidification on the TSUBAME 2.0 supercomputer , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[132]  Michael Commer,et al.  A parallel finite-difference approach for 3D transient electromagnetic modeling with galvanic sources , 2004 .

[133]  Chau-Wen Tseng,et al.  Tiling Optimizations for 3D Scientific Computations , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[134]  Ligang Lu,et al.  Multi-level parallel computing of reverse time migration for seismic imaging on blue Gene/Q , 2013, PPoPP '13.

[135]  J. Etgen,et al.  Seismic migration problems and solutions , 2001 .

[136]  Gerhard Wellein,et al.  Introduction to High Performance Computing for Scientists and Engineers , 2010, Chapman and Hall / CRC computational science series.

[137]  R. Plessix A review of the adjoint-state method for computing the gradient of a functional with geophysical applications , 2006 .

[138]  Peter Messmer,et al.  Accelerating Stencil-Based Computations by Increased Temporal Locality on Modern Multi- and Many-Core Architectures , 2008 .

[139]  Mauricio Hanzich,et al.  Assessing Accelerator-Based HPC Reverse Time Migration , 2011, IEEE Transactions on Parallel and Distributed Systems.

[141]  Margaret H. Wright,et al.  The opportunities and challenges of exascale computing , 2010 .

[142]  Gerhard Wellein,et al.  Quantifying Performance Bottlenecks of Stencil Computations Using the Execution-Cache-Memory Model , 2014, ICS.

[143]  Don C. Lawton,et al.  An acquisition polarity standard for multicomponent seismic data , 2000 .

[144]  Robin P. Fletcher,et al.  Time-varying boundary conditions in simulation of seismic wave propagation , 2011 .

[145]  Tarek S. Abdelrahman,et al.  Parallel Radix Sort on the AMD Fusion Accelerated Processing Unit , 2013, 2013 42nd International Conference on Parallel Processing.

[146]  Edip Baysal,et al.  Forward modeling by a Fourier method , 1982 .

[147]  K. Yee Numerical solution of initial boundary value problems involving maxwell's equations in isotropic media , 1966 .

[148]  Kirannmayi M. Sirasala,et al.  Experience of Porting and Optimization of Seismic Modelling on Multi and Many Cores of Hybrid Computing Cluster , 2015 .

[149]  P. Schultz,et al.  Fundamentals of geophysical data processing , 1979 .

[150]  Biondo L. Biondi,et al.  3D Seismic Imaging , 2006 .

[151]  Jack J. Dongarra,et al.  Accelerating GPU Kernels for Dense Linear Algebra , 2010, VECPAR.

[152]  Niels Kuster,et al.  Comparison of CPML Implementations for the GPU-Accelerated FDTD Solver , 2011 .

[153]  R. Stolt MIGRATION BY FOURIER TRANSFORM , 1978 .

[154]  Kristel C. Meza-Fajardo,et al.  A Nonconvolutional, Split-Field, Perfectly Matched Layer for Wave Propagation in Isotropic and Anisotropic Elastic Media: Stability Analysis , 2008 .

[155]  Ray L. Sengbush Seismic Exploration Methods , 1983 .

[156]  Yue Wang,et al.  REVERSE-TIME MIGRATION , 1999 .

[157]  K. R. Kelly,et al.  SYNTHETIC SEISMOGRAMS: A FINITE ‐DIFFERENCE APPROACH , 1976 .

[158]  Carlos Couder-Castañeda,et al.  TESLA GPUs versus MPI with OpenMP for the Forward Modeling of Gravity and Gravity Gradient of Large Prisms Ensemble , 2013, J. Appl. Math..

[159]  Sergei Gorlatch,et al.  Extending the SkelCL Skeleton Library for Stencil Computations on Multi-GPU Systems , 2014 .