Massively Scaling Seismic Processing on Sunway TaihuLight Supercomputer

Common Midpoint (CMP) and Common Reflection Surface (CRS) are widely used methods for improving the signal-to-noise ratio in the field of seismic processing. These methods are computationally intensive and require high-performance computing. This article optimizes these methods on the Sunway many-core architecture and implements large-scale seismic processing on the Sunway Taihulight supercomputer. We propose the following three optimization techniques: 1) we propose a software cache method to reduce the overhead of memory accesses, and share data among CPEs via the register communication; 2) we re-design the semblance calculation procedure to further reduce the overhead of memory accesses; 3) we propose a vectorization method to improve the performance when processing the small volume of data within short loops. The experimental results show that our implementations of CMP and CRS methods on Sunway achieve 3.50× and 3.01× speedup on average compared to the-state-of-the-art implementations on CPU. In addition, our implementation is capable to run on more than one million cores of Sunway TaihuLight with good scalability.

[1]  Wenguang Chen,et al.  ShenTu: Processing Multi-Trillion Edge Graphs on Millions of Cores in Seconds , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[2]  James Lin,et al.  Benchmarking SW26010 Many-Core Processor , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[3]  Ümit V. Çatalyürek,et al.  Graph coloring algorithms for multi-core and massively multithreaded architectures , 2012, Parallel Comput..

[4]  Depei Qian,et al.  Multi-role SpTRSV on Sunway Many-Core Architecture , 2018, 2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS).

[5]  Rachmat Sule,et al.  Implementation of non uniform memory address (NUMA) parallel computation in order to speed up the common reflection surface (CRS) stack optimization process , 2015 .

[6]  Flavia Pisani,et al.  A Comparative Study of SYCL, OpenCL, and OpenMP , 2016, 2016 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW).

[7]  Ernesto Bonomi,et al.  OpenCL implementation of the 3D CRS optimization algorithm , 2011 .

[8]  Weiguo Liu,et al.  18.9-Pflops Nonlinear Earthquake Simulation on Sunway TaihuLight: Enabling Depiction of 18-Hz and 8-Meter Scenarios , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[9]  Lashgar Ahmad,et al.  OpenACC Cache Directive: Opportunities and Optimizations , 2016 .

[10]  Flavia Pisani,et al.  Evaluating the Performance and Cost of Accelerating Seismic Processing with CUDA, OpenCL, OpenACC, and OpenMP , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[11]  Wenguang Chen,et al.  pLock: A Fast Lock for Architectures with Explicit Inter-core Message Passing , 2019, ASPLOS.

[12]  Clifford H. Thurber,et al.  A Graphics Processing Unit Implementation for Time–Frequency Phase‐Weighted Stacking , 2016 .

[13]  Benoît Dupont de Dinechin,et al.  A clustered manycore processor architecture for embedded and accelerated applications , 2013, 2013 IEEE High Performance Extreme Computing Conference (HPEC).

[14]  Guangwen Yang,et al.  swCaffe: A Parallel Framework for Accelerating Deep Learning Applications on Sunway TaihuLight , 2018, 2018 IEEE International Conference on Cluster Computing (CLUSTER).

[15]  Chao Yang,et al.  26 PFLOPS Stencil Computations for Atmospheric Modeling on Sunway TaihuLight , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[16]  Meng Zhang,et al.  Redesigning LAMMPS for Peta-Scale and Hundred-Billion-Atom Simulation on Sunway TaihuLight , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[17]  Wei Zhang,et al.  Simulating the Wenchuan Earthquake with Accurate Surface Topography on Sunway TaihuLight , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[18]  Jon F. Claerbout,et al.  The migration of common midpoint slant stacks , 1984 .

[19]  Kai Yang,et al.  A GPU Based 3D Common Reflection Surface Stack Algorithm with the Output Imaging Scheme (3D-CRS-OIS) , 2012 .

[20]  S. M. Doherty,et al.  Seismic Data Analysis: Processing, Inversion, and Interpretation of Seismic Data , 2000 .

[21]  Samuel Williams,et al.  The potential of the cell processor for scientific computing , 2005, CF '06.

[22]  Xin Liu,et al.  Towards Efficient SpMV on Sunway Manycore Architectures , 2018, ICS.

[23]  Peter Hubral,et al.  Common-reflection-surface stack: Image and attributes , 2001 .

[24]  Oliver Pell,et al.  Fast 3D ZO CRS Stack – An FPGA Implementation of an Optimization Based on the Simultaneous Estimate of Eight Parameters , 2010 .

[25]  Guangwen Yang,et al.  swDNN: A Library for Accelerating Deep Learning Applications on Sunway TaihuLight , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[26]  Xu Ping,et al.  10M-Core Scalable Fully-Implicit Solver for Nonhydrostatic Atmospheric Dynamics , 2016 .

[27]  Wei Ge,et al.  The Sunway TaihuLight supercomputer: system and applications , 2016, Science China Information Sciences.