Preparing depth imaging applications for Exascale challenges and impacts. (Etude de l'adéquation des machines Exascale pour les algorithmes implémentant la méthode du Reverse Time Migation)

As we are expecting Exascale systems for the 2018-2020 time frame, performance analysis and characterization of applications for new processor architectures and large scale systems are important tasks that permit to anticipate the required changes to efficiently exploit the future HPC systems. This thesis focuses on seismic imaging applications used for modeling complex physical phenomena, in particular the depth imaging application called Reverse Time Migration (RTM). My first contribution consists in characterizing and modeling the performance of the computational core of RTM which is based on finite-difference time-domain (FDTD) computations. I identify and explore the major tuning parameters influencing performance and the interaction between the architecture and the application. The second contribution is an analysis to identify the challenges for a hybrid and heterogeneous implementation of FDTD for manycore architectures. We target Intel’s first Xeon Phi co-processor, the Knights Corner. This architecture is an interesting proxy for our study since it contains some of the expected features of an Exascale system: concurrency and heterogeneity.My third contribution is an extension of the performance analysis and modeling to the full RTM. This adds communications and IOs to the computation part. RTM is a data intensive application and requires the storage of intermediate values of the computational field resulting in expensive IO accesses. My fourth contribution is the final measurement and model validation of my hybrid RTM implementation on a large system. This has been done on Stampede, a machine of the Texas Advanced Computing Center (TACC), which allows us to test the scalability up to 64 nodes each containing one 61-core Xeon Phi and two 8-core CPUs for a total close to 5000 heterogeneous cores

[1]  William W. Symes,et al.  Reverse time migration with optimal checkpointing , 2007 .

[2]  Yurii A. Vlasov,et al.  Technologies for exascale systems , 2011, IBM J. Res. Dev..

[3]  Roelof Versteeg,et al.  Sensitivity of prestack depth migration to the velocity model , 1993 .

[4]  Christopher Batten,et al.  Simplified vector-thread architectures for flexible and efficient data-parallel accelerators , 2010 .

[5]  John Shalf,et al.  Rethinking Hardware-Software Codesign for Exascale Systems , 2011, Computer.

[6]  James Demmel,et al.  Communication-Avoiding Parallel Strassen: Implementation and performance , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[7]  Wei Liu,et al.  Anisotropic Reverse-Time Migration Using Co-Processors , 2009 .

[8]  Henri Calandra,et al.  A Coarray Fortran implementation to support data-intensive application development , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[9]  Kristofer Davis,et al.  Fast solution of geophysical inversion using adaptive mesh, space-filling curves and wavelet compression , 2011 .

[10]  Lurng-Kuo Liu,et al.  Reducing Data Movement Costs: Scalable Seismic Imaging on Blue Gene , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[11]  Hervé Chauris,et al.  Tips and tricks for Finite difference and i/o-less FWI , 2011 .

[12]  Charles E. Leiserson,et al.  Cache-Oblivious Algorithms , 2003, CIAC.

[13]  Kristof Beyls,et al.  Reuse Distance as a Metric for Cache Behavior. , 2001 .

[14]  Peyman P. Moghaddam,et al.  Industrial-Scale Reverse Time Migration On GPU Hardware , 2009 .

[15]  Kunle Olukotun,et al.  The Future of Microprocessors , 2005, ACM Queue.

[16]  William J. Dally,et al.  GPUs and the Future of Parallel Computing , 2011, IEEE Micro.

[17]  Benoît Meister,et al.  Runnemede: An architecture for Ubiquitous High-Performance Computing , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[18]  Lijian Tan,et al.  Time-reversal checkpointing methods for RTM and FWI , 2012 .

[19]  Arthur R. Butz,et al.  Space Filling Curves and Mathematical Programming , 1968, Inf. Control..

[20]  William Gropp,et al.  An introductory exascale feasibility study for FFTs and multigrid , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[21]  Chau-Wen Tseng,et al.  Tiling Optimizations for 3D Scientific Computations , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[22]  Margaret H. Wright,et al.  The opportunities and challenges of exascale computing , 2010 .

[23]  Courtenay T. Vaughan,et al.  Navigating an Evolutionary Fast Path to Exascale , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[24]  Samuel Williams,et al.  Hardware/software co-design for energy-efficient seismic modeling , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[25]  Andreas Griewank,et al.  Algorithm 799: revolve: an implementation of checkpointing for the reverse or adjoint mode of computational differentiation , 2000, TOMS.

[26]  Paul Farmer,et al.  Application of Reverse Time Migration to Complex Imaging Problems , 2006 .

[27]  Richard W. Vuduc,et al.  Balance Principles for Algorithm-Architecture Co-Design , 2011, HotPar.

[28]  Paul L. Stoffa,et al.  Implicit finite-difference simulations of seismic wave propagation , 2012 .

[29]  Hsien-Hsin S. Lee,et al.  An optimized 3D-stacked memory architecture by exploiting excessive, high-density TSV bandwidth , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[30]  Eric Petit,et al.  Adaptive sampling for performance characterization of application kernels , 2013, Concurr. Comput. Pract. Exp..

[31]  Tariq Alkhalifah An Acoustic Wave Equation For Anisotropic Media , 1998 .

[32]  Gabriel H. Loh,et al.  3D-Stacked Memory Architectures for Multi-core Processors , 2008, 2008 International Symposium on Computer Architecture.

[33]  Mauricio Hanzich,et al.  Assessing Accelerator-Based HPC Reverse Time Migration , 2011, IEEE Transactions on Parallel and Distributed Systems.

[34]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[35]  Jean Virieux,et al.  Finite-difference frequency-domain modeling of viscoacoustic wave propagation in 2D tilted transversely isotropic (TTI) media , 2009 .

[36]  Valerio Pascucci,et al.  Global static indexing for real-time exploration of very large regular grids , 2001, SC.

[37]  Michael Bader,et al.  A Cache Oblivious Algorithm for Matrix Multiplication Based on Peano's Space Filling Curve , 2005, PPAM.

[38]  Chen Ding,et al.  Regression-Based Multi-Model Prediction of Data Reuse Signature , 2003 .

[39]  Gerhard Wellein,et al.  Leveraging Shared Caches for Parallel Temporal Blocking of Stencil Codes on Multicore Processors and Clusters , 2010, Parallel Process. Lett..

[40]  Peter M. Kogge,et al.  Using the TOP500 to trace and project technology and architecture trends , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[41]  Guillaume Mercier,et al.  hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications , 2010, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing.

[42]  J. Carcione,et al.  Seismic modeling , 1942 .

[43]  R. Pratt Seismic waveform inversion in the frequency domain; Part 1, Theory and verification in a physical scale model , 1999 .

[44]  Pradeep Dubey,et al.  3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[45]  Anthony Skjellum,et al.  A framework for high‐performance matrix multiplication based on hierarchical abstractions, algorithms and optimized low‐level kernels , 2002, Concurr. Comput. Pract. Exp..

[46]  P. Moczo,et al.  The finite-difference time-domain method for modeling of seismic wave propagation , 2007 .

[47]  John M. Mellor-Crummey,et al.  Cross-architecture performance predictions for scientific applications using parameterized models , 2004, SIGMETRICS '04/Performance '04.

[48]  Amal Khabou,et al.  Calculs pour les matrices denses : coût de communication et stabilité numérique. (Dense matrix computations : communication cost and numerical stability) , 2013 .

[49]  Laxmikant V. Kalé,et al.  Architectural Constraints to Attain 1 Exaflop/s for Three Scientific Application Classes , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[50]  Jack J. Dongarra,et al.  The LINPACK Benchmark: past, present and future , 2003, Concurr. Comput. Pract. Exp..

[51]  Helmar Burkhart,et al.  PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[52]  Chen Ding,et al.  Miss Rate Prediction Across Program Inputs and Cache Configurations , 2007, IEEE Transactions on Computers.

[53]  Mauricio Hanzich,et al.  Evaluation of 3D RTM On HPC Platforms , 2008 .

[54]  Jean Virieux,et al.  An overview of full-waveform inversion in exploration geophysics , 2009 .

[55]  Richard Vuduc,et al.  Prospects for scalable 3D FFTs on heterogeneous exascale systems , 2011 .

[56]  John Shalf,et al.  The International Exascale Software Project roadmap , 2011, Int. J. High Perform. Comput. Appl..

[57]  Gerhard Wellein,et al.  Efficient multicore-aware parallelization strategies for iterative stencil computations , 2010, J. Comput. Sci..

[58]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[59]  Paul Farmer,et al.  The role of reverse time migration in imaging and model estimation , 2009 .

[60]  Robert Tibshirani,et al.  Bootstrap Methods for Standard Errors, Confidence Intervals, and Other Measures of Statistical Accuracy , 1986 .

[61]  K. R. Kelly,et al.  SYNTHETIC SEISMOGRAMS: A FINITE ‐DIFFERENCE APPROACH , 1976 .

[62]  William J. Camp,et al.  Trends for high-performance scientific computing , 2010 .

[63]  Richard W. Vuduc,et al.  When Prefetching Works, When It Doesn’t, and Why , 2012, TACO.

[64]  S. Brandsberg-Dahl,et al.  The 2004 BP Velocity Benchmark , 2005 .

[65]  Gerhard Wellein,et al.  Efficient Temporal Blocking for Stencil Computations by Multicore-Aware Wavefront Parallelization , 2009, 2009 33rd Annual IEEE International Computer Software and Applications Conference.

[66]  R. Clapp Reverse time migration with random boundaries , 2009 .

[67]  John Shalf,et al.  Exascale Computing Technology Challenges , 2010, VECPAR.

[68]  Guohua Jin,et al.  Using Space-filling Curves for Computation Reordering , 2005 .

[69]  R. Courant,et al.  On the Partial Difference Equations, of Mathematical Physics , 2015 .

[70]  Henri Calandra,et al.  A review of the spectral, pseudo‐spectral, finite‐difference and finite‐element modelling techniques for geophysical imaging , 2011 .

[71]  P. Lailly,et al.  Marmousi, model and data , 1990 .

[72]  Ken Kennedy,et al.  Improving Memory Hierarchy Performance for Irregular Applications Using Data and Computation Reorderings , 2001, International Journal of Parallel Programming.

[73]  Edip Baysal,et al.  Forward modeling by a Fourier method , 1982 .

[74]  Biondo L. Biondi,et al.  3D Seismic Imaging , 2006 .

[75]  Robin P. Fletcher,et al.  Reverse time migration in tilted transversely isotropic "TTI… media , 2009 .

[76]  Yutao Zhong,et al.  Predicting whole-program locality through reuse distance analysis , 2003, PLDI.

[77]  Gerhard Wellein,et al.  LIKWID: A Lightweight Performance-Oriented Tool Suite for x86 Multicore Environments , 2010, 2010 39th International Conference on Parallel Processing Workshops.

[78]  Shekhar Y. Borkar Exascale Computing - A Fact or a Fiction? , 2013, IPDPS.

[79]  William W. Symes,et al.  Computational Strategies For Reverse-time Migration , 2008 .

[80]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[81]  Irving L. Traiger,et al.  Evaluation Techniques for Storage Hierarchies , 1970, IBM Syst. J..

[82]  Paulius Micikevicius,et al.  3D finite difference computation on GPUs using CUDA , 2009, GPGPU-2.