Fault Tolerant Lanczos Eigensolver via an Invariant Checking Method

An extensive survey of the literature shows that the Lanczos eigensolver is a popular iterative method for approximating a few maximal eigenvalues of a real symmetric matrix, particularly if the matrix is large and sparse. In recent years, graphics processing units (GPUs) have become a popular platform for scientific computing applications, many of which are based on linear algebra, and are increasingly being used as the main computational units in supercomputers. This trend is expected to continue as the number of computations required by scientific applications reach petascale and exascale range. In this paper, building on our earlier work [22], we investigate in detail the error checking mechanism for the Lanczos eigensolver. We identify a low cost invariant for efficient error checking, and through mathematical analysis determine the efficiency of our mechanism when used by the Lanczos eigensolver. We evaluate the proposed fault tolerant scheme using an open-source sparse eigensolver on a GPU platform, with and without the injection of faults. We use a large number of sparse matrices from real applications, to determine the efficiency and efficacy of our method and our implementation shows that the proposed fault tolerant method has good error coverage and low overhead. To the best of our knowledge, we are the first to introduce such a scheme for the Lanczos method.

[1]  Bin Nie,et al.  A large-scale study of soft-errors on GPUs in the field , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[2]  William Gropp,et al.  Efficient Management of Parallelism in Object-Oriented Numerical Software Libraries , 1997, SciTools.

[3]  Gene H. Golub,et al.  Matrix computations (3rd ed.) , 1996 .

[4]  Padma Raghavan,et al.  Fault tolerant preconditioned conjugate gradient for sparse linear system solution , 2012, ICS '12.

[5]  C. Lanczos An iteration method for the solution of the eigenvalue problem of linear differential and integral operators , 1950 .

[6]  Zizhong Chen,et al.  Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods , 2013, PPoPP '13.

[7]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[8]  Claus Braun,et al.  A-ABFT: Autonomous Algorithm-Based Fault Tolerance for Matrix Multiplications on Graphics Processing Units , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[9]  Michael A. Heroux Software Challenges for Extreme Scale Computing: Going From Petascale to Exascale Systems , 2009, Int. J. High Perform. Comput. Appl..

[10]  Zizhong Chen,et al.  Fail-Stop Failure Algorithm-Based Fault Tolerance for Cholesky Decomposition , 2015, IEEE Transactions on Parallel and Distributed Systems.

[11]  Bronis R. de Supinski,et al.  Soft error vulnerability of iterative linear algebra methods , 2007, ICS '08.

[12]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[13]  Andrew V. Knyazev,et al.  Toward the Optimal Preconditioned Eigensolver: Locally Optimal Block Preconditioned Conjugate Gradient Method , 2001, SIAM J. Sci. Comput..

[14]  Gene H. Golub,et al.  Matrix computations , 1983 .

[15]  William Gropp,et al.  PETSc Users Manual Revision 3.4 , 2016 .

[16]  Vicente Hernández,et al.  SLEPc: A scalable and flexible toolkit for the solution of eigenvalue problems , 2005, TOMS.

[17]  Hyesoon Kim,et al.  Performance Analysis and Tuning for General Purpose Graphics Processing Units , 2012 .

[18]  Rakesh Kumar,et al.  Algorithmic approaches to low overhead fault detection for sparse linear algebra , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[19]  Parameswaran Ramanathan,et al.  Fault Tolerance through Invariant Checking for the Lanczos Eigensolver , 2020, 2020 33rd International Conference on VLSI Design and 2020 19th International Conference on Embedded Systems (VLSID).

[20]  Frank Mueller,et al.  Evaluating the Impact of SDC on the GMRES Iterative Solver , 2013, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[21]  Jyothi Velamala,et al.  Soft Error Rate Improvements in 14-nm Technology Featuring Second-Generation 3D Tri-Gate Transistors , 2015, IEEE Transactions on Nuclear Science.

[22]  Parameswaran Ramanathan,et al.  Transient Fault Resilient QR Factorization on GPUs , 2015, FTXS@HPDC.

[23]  Tilak Agerwala Exascale computing: The challenges and opportunities in the next decade , 2010, HPCA.

[24]  Barry F. Smith,et al.  PETSc Users Manual , 2019 .

[25]  Claus Braun,et al.  Low-overhead fault-tolerance for the preconditioned conjugate gradient solver , 2015, 2015 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFTS).

[26]  AngryCalc NVIDIA GeForce GTX 1080 , 2018 .

[27]  W. Arnoldi The principle of minimized iterations in the solution of the matrix eigenvalue problem , 1951 .

[28]  Zizhong Chen,et al.  Online Algorithm-Based Fault Tolerance for Cholesky Decomposition on Heterogeneous Systems with GPUs , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[29]  Dingwen Tao,et al.  Towards Practical Algorithm Based Fault Tolerance in Dense Linear Algebra , 2016, HPDC.

[30]  Parameswaran Ramanathan,et al.  Fault Tolerance through Invariant Checking for Iterative Solvers , 2016, 2016 29th International Conference on VLSI Design and 2016 15th International Conference on Embedded Systems (VLSID).

[31]  Mehdi Baradaran Tahoori,et al.  Numerical Defect Correction as an Algorithm-Based Fault Tolerance Technique for Iterative Solvers , 2011, 2011 IEEE 17th Pacific Rim International Symposium on Dependable Computing.