论文信息 - Low-Order Finite Element Solver with Small Matrix-Matrix Multiplication Accelerated by AI-Specific Hardware for Crustal Deformation Computation

Low-Order Finite Element Solver with Small Matrix-Matrix Multiplication Accelerated by AI-Specific Hardware for Crustal Deformation Computation

This study proposes a fast low-order finite element solver for crustal deformation computations by applying Tensor Core, AI-specific hardware on a Volta GPU. Tensor Core can compute large matrix-matrix multiplications rapidly in half precision. We redesign a state-of-the-art solver algorithm so that lower-precision data types can be used and memory access costs can be reduced even when we use small matrices. With the proposed solver, we solved 13 billion degrees-of-freedom two-layered problems that mimicked the Earth's crust and mantle using 36 compute nodes of Summit. In the matrix-vector kernel, we obtained a 4.1-fold speedup over a standard kernel in a single-precision format. Our proposed solver increased the FLOP count of the entire solver; however, we reduced the time-to-solution by 1.7-fold since the Tensor Core provided a high effective performance.

[1] Thomas J. R. Hughes,et al. Solution algorithms for nonlinear transient heat conduction analysis employing element-by-element iterative strategies , 1985 .

[2] Ian Parsons,et al. Surface deformation due to shear and tensile faults in a half-space , 1986 .

[3] Yousef Saad,et al. A Flexible Inner-Outer Preconditioned GMRES Algorithm , 1993, SIAM J. Sci. Comput..

[4] Gene H. Golub,et al. Inexact Preconditioned Conjugate Gradient Method with Inner-Outer Iteration , 1999, SIAM J. Sci. Comput..

[5] T. Masterlark. Finite element model predictions of static deformation from dislocation sources in a subduction zone: Sensitivities to homogeneous, isotropic, Poisson-solid, and half-space assumptions , 2003 .

[6] Yousef Saad,et al. Iterative methods for sparse linear systems , 2003 .

[7] Chihiro Hashimoto,et al. 3-D Modelling of Plate Interfaces and Numerical Simulation of Long-term Crustal Deformation in and around Japan , 2004 .

[8] Yukitoshi Fukahata,et al. General expressions for internal deformation fields due to a dislocation source in a multilayered elastic half-space , 2005 .

[9] Takeji Kometani. GPS Earth Observation Network System , 2005 .

[10] Tsuyoshi Ichimura,et al. Earthquake Motion Simulation with Multiscale Finite-Element Analysis on Hybrid Grid , 2007 .

[11] John Z. Lou,et al. Geophysical Finite-Element Simulation Tool (GeoFEST): Algorithms and Validation for Quasistatic Regional Faulted Crust Problems , 2008 .

[12] Paulius Micikevicius,et al. 3D finite difference computation on GPUs using CUDA , 2009, GPGPU-2.

[13] Kipton Barros,et al. Solving lattice QCD systems of equations using mixed precision solvers on GPUs , 2009, Comput. Phys. Commun..

[14] Jack J. Dongarra,et al. Towards dense linear algebra for hybrid GPU accelerated manycore systems , 2009, Parallel Comput..

[15] Walter D. Mooney,et al. Poroelastic stress-triggering of the 2005 M8.7 Nias earthquake by the 2004 M9.2 Sumatra–Andaman earthquake , 2010 .

[16] Christian Bignami,et al. Coseismic slip distribution for the Mw 9 2011 Tohoku‐Oki earthquake derived from 3‐D FE modeling , 2013 .

[17] James L. Beck,et al. Bayesian inversion for finite fault earthquake source models I—theory and algorithm , 2013 .

[18] Tsuyoshi Ichimura,et al. Physics-Based Urban Earthquake Simulation Enhanced by 10.7 BlnDOF × 30 K Time-Step Unstructured FE Non-Linear Seismic Wave Simulation , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[19] Constantine Bekas,et al. An extreme-scale implicit solver for complex PDEs: highly heterogeneous flow in earth's mantle , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[20] Chetan Jhurani,et al. A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices , 2013, J. Parallel Distributed Comput..

[21] Pher Errol Balde Quinay,et al. Implicit nonlinear wave simulation with 1.08T DOF and 0.270T unstructured finite elements to enhance comprehensive earthquake simulation , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[22] Pritish Narayanan,et al. Deep Learning with Limited Numerical Precision , 2015, ICML.

[23] Ronald M. Summers,et al. Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning , 2016, IEEE Transactions on Medical Imaging.

[24] Jack J. Dongarra,et al. High-Performance Tensor Contractions for GPUs , 2016, ICCS.

[25] Ole Sigmund,et al. Giga-voxel computational morphogenesis for structural design , 2017, Nature.

[26] Timothy A. Davis,et al. Algorithm 9xx: Sparse QR Factorization on the GPU , 2015 .

[27] Tsuyoshi Ichimura,et al. Fast and Scalable Low-Order Implicit Unstructured Finite-Element Solver for Earth's Crust Deformation Problem , 2017, PASC.

[28] Marco Maggioni,et al. Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking , 2018, ArXiv.

[29] Tjerk P. Straatsma,et al. A Fast Scalable Implicit Solver for Nonlinear Time-Evolution Earthquake City Problem on Low-Ordered Unstructured Finite Elements with Artificial Intelligence and Transprecision Computing , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[30] Yuri Fialko,et al. Observations and Modeling of Coseismic and Postseismic Deformation Due To the 2015 Mw 7.8 Gorkha (Nepal) Earthquake , 2018 .

[31] Nicholas J. Higham,et al. Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[32] Jack J. Dongarra,et al. The Design of Fast and Energy-Efficient Linear Solvers: On the Potential of Half-Precision Arithmetic and Iterative Refinement Techniques , 2018, ICCS.

[33] Nicholas J. Higham,et al. Accelerating the Solution of Linear Systems by Iterative Refinement in Three Precisions , 2018, SIAM J. Sci. Comput..

[34] Jeffrey S. Vetter,et al. NVIDIA Tensor Core Programmability, Performance & Precision , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[35] Jack Dongarra,et al. Fast Batched Matrix Multiplication for Small Sizes Using Half-Precision Arithmetic on GPUs , 2019, 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[36] Tor M. Aamodt,et al. Modeling Deep Learning Accelerator Enabled GPUs , 2018, 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).