Evaluating multi-core and many-core architectures through accelerating the three-dimensional Lax–Wendroff correction stencil

Wave propagation forward modeling is a widely used computational method in oil and gas exploration. The iterative stencil loops in such problems have broad applications in scientific computing. However, executing such loops can be highly time-consuming, which greatly limits their performance and power efficiency. In this paper, we accelerate the forward-modeling technique on the latest multi-core and many-core architectures such as Intel® Sandy Bridge CPUs, NVIDIA Fermi C2070 GPUs, NVIDIA Kepler K20× GPUs, and the Intel® Xeon Phi co-processor. For the GPU platforms, we propose two parallel strategies to explore the performance optimization opportunities for our stencil kernels. For Sandy Bridge CPUs and MIC, we also employ various optimization techniques in order to achieve the best performance. Although our stencil with 114 component variables poses several great challenges for performance optimization, and the low stencil ratio between computation and memory access is too inefficient to fully take advantage of our evaluated architectures, we manage to achieve performance efficiencies ranging from 4.730% to 20.02% of the theoretical peak. We also conduct cross-platform performance and power analysis (focusing on Kepler GPU and MIC) and the results could serve as insights for users selecting the most suitable accelerators for their targeted applications.

[1]  Michael Klemm,et al.  Extending a Highly Parallel Data Mining Algorithm to the Intel ® Many Integrated Core Architecture , 2011, Euro-Par Workshops.

[2]  Christian Terboven,et al.  OpenACC - First Experiences with Real-World Applications , 2012, Euro-Par.

[3]  William J. Dally,et al.  The GPU Computing Era , 2010, IEEE Micro.

[4]  Ingo Wald,et al.  Combining Single and Packet-Ray Tracing for Arbitrary Ray Distributions on the Intel MIC Architecture , 2012, IEEE Transactions on Visualization and Computer Graphics.

[5]  P. Lax,et al.  Difference schemes for hyperbolic equations with high order of accuracy , 1964 .

[6]  Giorgio Valle,et al.  CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment , 2008, BMC Bioinformatics.

[7]  T. Okamoto,et al.  Accelerating large-scale simulation of seismic wave propagation by multi-GPUs and three-dimensional domain decomposition , 2010 .

[8]  Rajat Raina,et al.  Large-scale deep unsupervised learning using graphics processors , 2009, ICML '09.

[9]  Mario Cannataro,et al.  Euro-Par 2011: Parallel Processing Workshops , 2011, Lecture Notes in Computer Science.

[10]  Dimitri Komatitsch,et al.  Accelerating a three-dimensional finite-difference wave propagation code using GPU graphics cards , 2010 .

[11]  Vladimir Surkov Parallel option pricing with Fourier Space Time-stepping method on Graphics Processing Units , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[12]  Zhiyuan Li,et al.  Automatic tiling of iterative stencil loops , 2004, TOPL.

[13]  Victor W. Lee,et al.  Fast Sort on CPUs , GPUs and Intel MIC Architectures , 2010 .

[14]  Trevor N. Mudge,et al.  Power: A First-Class Architectural Design Constraint , 2001, Computer.

[15]  M. Balakrishnan Power Consumption in Multi-core Processors , 2012, IC3.

[16]  Haohuan Fu,et al.  Selecting the right hardware for reverse time migration , 2010 .

[17]  Liu Guo-feng GPU/CPU co-processing parallel computation for seismic data processing in oil and gas exploration , 2009 .

[18]  Johan O. A. Robertsson,et al.  A modified Lax-Wendroff correction for wave propagation in media described by Zener elements , 1997 .

[19]  Alejandro Duran,et al.  The Intel® Many Integrated Core Architecture , 2012, 2012 International Conference on High Performance Computing & Simulation (HPCS).

[20]  Jack J. Dongarra,et al.  A Portable Programming Interface for Performance Evaluation on Modern Processors , 2000, Int. J. High Perform. Comput. Appl..

[21]  William W. Symes,et al.  Dispersion analysis of numerical wave propagation and its computational consequences , 1995 .

[22]  Frank Mueller,et al.  Auto-generation and auto-tuning of 3D stencil codes on GPU clusters , 2012, CGO '12.

[23]  Tsutomu Maruyama,et al.  Performance comparison of FPGA, GPU and CPU in image processing , 2009, 2009 International Conference on Field Programmable Logic and Applications.

[24]  M. A. Dablain,et al.  The application of high-order differencing to the scalar wave equation , 1986 .

[25]  Scott B. Baden,et al.  Mint: realizing CUDA performance in 3D stencil methods with annotated C , 2011, ICS '11.

[26]  F. Al-Shamali,et al.  Author Biographies. , 2015, Journal of social work in disability & rehabilitation.

[27]  Frank Mueller,et al.  Autogeneration and Autotuning of 3D Stencil Codes on Homogeneous and Heterogeneous GPU Clusters , 2013, IEEE Transactions on Parallel and Distributed Systems.

[28]  Paulius Micikevicius,et al.  3D finite difference computation on GPUs using CUDA , 2009, GPGPU-2.

[29]  Pradeep Dubey,et al.  Can traditional programming bridge the Ninja performance gap for parallel computing applications , 2012, ISCA 2012.

[30]  Kevin Skadron,et al.  Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs , 2009, ICS.