Low-Order Finite Element Solver with Small Matrix-Matrix Multiplication Accelerated by AI-Specific Hardware for Crustal Deformation Computation

This study proposes a fast low-order finite element solver for crustal deformation computations by applying Tensor Core, AI-specific hardware on a Volta GPU. Tensor Core can compute large matrix-matrix multiplications rapidly in half precision. We redesign a state-of-the-art solver algorithm so that lower-precision data types can be used and memory access costs can be reduced even when we use small matrices. With the proposed solver, we solved 13 billion degrees-of-freedom two-layered problems that mimicked the Earth's crust and mantle using 36 compute nodes of Summit. In the matrix-vector kernel, we obtained a 4.1-fold speedup over a standard kernel in a single-precision format. Our proposed solver increased the FLOP count of the entire solver; however, we reduced the time-to-solution by 1.7-fold since the Tensor Core provided a high effective performance.

[1]  Thomas J. R. Hughes,et al.  Solution algorithms for nonlinear transient heat conduction analysis employing element-by-element iterative strategies , 1985 .

[2]  Ian Parsons,et al.  Surface deformation due to shear and tensile faults in a half-space , 1986 .

[3]  Yousef Saad,et al.  A Flexible Inner-Outer Preconditioned GMRES Algorithm , 1993, SIAM J. Sci. Comput..

[4]  Gene H. Golub,et al.  Inexact Preconditioned Conjugate Gradient Method with Inner-Outer Iteration , 1999, SIAM J. Sci. Comput..

[5]  T. Masterlark Finite element model predictions of static deformation from dislocation sources in a subduction zone: Sensitivities to homogeneous, isotropic, Poisson-solid, and half-space assumptions , 2003 .

[6]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[7]  Chihiro Hashimoto,et al.  3-D Modelling of Plate Interfaces and Numerical Simulation of Long-term Crustal Deformation in and around Japan , 2004 .

[8]  Yukitoshi Fukahata,et al.  General expressions for internal deformation fields due to a dislocation source in a multilayered elastic half-space , 2005 .

[9]  Takeji Kometani GPS Earth Observation Network System , 2005 .

[10]  Tsuyoshi Ichimura,et al.  Earthquake Motion Simulation with Multiscale Finite-Element Analysis on Hybrid Grid , 2007 .

[11]  John Z. Lou,et al.  Geophysical Finite-Element Simulation Tool (GeoFEST): Algorithms and Validation for Quasistatic Regional Faulted Crust Problems , 2008 .

[12]  Paulius Micikevicius,et al.  3D finite difference computation on GPUs using CUDA , 2009, GPGPU-2.

[13]  Kipton Barros,et al.  Solving lattice QCD systems of equations using mixed precision solvers on GPUs , 2009, Comput. Phys. Commun..

[14]  Jack J. Dongarra,et al.  Towards dense linear algebra for hybrid GPU accelerated manycore systems , 2009, Parallel Comput..

[15]  Walter D. Mooney,et al.  Poroelastic stress-triggering of the 2005 M8.7 Nias earthquake by the 2004 M9.2 Sumatra–Andaman earthquake , 2010 .

[16]  Christian Bignami,et al.  Coseismic slip distribution for the Mw 9 2011 Tohoku‐Oki earthquake derived from 3‐D FE modeling , 2013 .

[17]  James L. Beck,et al.  Bayesian inversion for finite fault earthquake source models I—theory and algorithm , 2013 .

[18]  Tsuyoshi Ichimura,et al.  Physics-Based Urban Earthquake Simulation Enhanced by 10.7 BlnDOF × 30 K Time-Step Unstructured FE Non-Linear Seismic Wave Simulation , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[19]  Constantine Bekas,et al.  An extreme-scale implicit solver for complex PDEs: highly heterogeneous flow in earth's mantle , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[20]  Chetan Jhurani,et al.  A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices , 2013, J. Parallel Distributed Comput..

[21]  Pher Errol Balde Quinay,et al.  Implicit nonlinear wave simulation with 1.08T DOF and 0.270T unstructured finite elements to enhance comprehensive earthquake simulation , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[22]  Pritish Narayanan,et al.  Deep Learning with Limited Numerical Precision , 2015, ICML.

[23]  Ronald M. Summers,et al.  Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning , 2016, IEEE Transactions on Medical Imaging.

[24]  Jack J. Dongarra,et al.  High-Performance Tensor Contractions for GPUs , 2016, ICCS.

[25]  Ole Sigmund,et al.  Giga-voxel computational morphogenesis for structural design , 2017, Nature.

[26]  Timothy A. Davis,et al.  Algorithm 9xx: Sparse QR Factorization on the GPU , 2015 .

[27]  Tsuyoshi Ichimura,et al.  Fast and Scalable Low-Order Implicit Unstructured Finite-Element Solver for Earth's Crust Deformation Problem , 2017, PASC.

[28]  Marco Maggioni,et al.  Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking , 2018, ArXiv.

[29]  Tjerk P. Straatsma,et al.  A Fast Scalable Implicit Solver for Nonlinear Time-Evolution Earthquake City Problem on Low-Ordered Unstructured Finite Elements with Artificial Intelligence and Transprecision Computing , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[30]  Yuri Fialko,et al.  Observations and Modeling of Coseismic and Postseismic Deformation Due To the 2015 Mw 7.8 Gorkha (Nepal) Earthquake , 2018 .

[31]  Nicholas J. Higham,et al.  Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[32]  Jack J. Dongarra,et al.  The Design of Fast and Energy-Efficient Linear Solvers: On the Potential of Half-Precision Arithmetic and Iterative Refinement Techniques , 2018, ICCS.

[33]  Nicholas J. Higham,et al.  Accelerating the Solution of Linear Systems by Iterative Refinement in Three Precisions , 2018, SIAM J. Sci. Comput..

[34]  Jeffrey S. Vetter,et al.  NVIDIA Tensor Core Programmability, Performance & Precision , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[35]  Jack Dongarra,et al.  Fast Batched Matrix Multiplication for Small Sizes Using Half-Precision Arithmetic on GPUs , 2019, 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[36]  Tor M. Aamodt,et al.  Modeling Deep Learning Accelerator Enabled GPUs , 2018, 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).