Online Algorithm-Based Fault Tolerance for Cholesky Decomposition on Heterogeneous Systems with GPUs

Extensive researches have been done on developing and optimizing algorithm-based fault tolerance (ABFT) schemes for systolic arrays and general purpose microprocessors. However, little has been done on developing and optimizing ABFT schemes for heterogeneous systems with GPU accelerators. While existing ABFT schemes can correct computing errors like 1+1=3, we find that many memory storage errors can not be corrected by existing ABFT schemes. In this paper, we first develop a new ABFT scheme for Cholesky decomposition that can correct both computing errors and storage errors at the same time, and then develop several optimization techniques to reduce the fault tolerance overhead of ABFT for heterogeneous systems with GPU accelerators. Experimental results demonstrate that our fault tolerant Cholesky decomposition is able to correct both computing errors and storage errors in the middle of the computation and can achieve better performance than the state-of-the-art vendor provided version Cholesky decomposition library routine in CULA R18.

[1]  Hui Liu,et al.  Matrix Multiplication on GPUs with On-Line Fault Tolerance , 2011, 2011 IEEE Ninth International Symposium on Parallel and Distributed Processing with Applications.

[2]  Jack J. Dongarra,et al.  High Performance Dense Linear System Solver with Resilience to Multiple Soft Errors , 2012, ICCS.

[3]  Zizhong Chen,et al.  Algorithmic Cholesky factorization fault recovery , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[4]  Zizhong Chen,et al.  Algorithm-Based Fault Tolerance for Fail-Stop Failures , 2008, IEEE Transactions on Parallel and Distributed Systems.

[5]  Claus Braun,et al.  Efficient on-line fault-tolerance for the preconditioned conjugate gradient method , 2015, 2015 IEEE 21st International On-Line Testing Symposium (IOLTS).

[6]  Thomas Hérault,et al.  Algorithm-based fault tolerance for dense matrix factorizations , 2012, PPoPP '12.

[7]  Zizhong Chen,et al.  Fail-Stop Failure Algorithm-Based Fault Tolerance for Cholesky Decomposition , 2015, IEEE Transactions on Parallel and Distributed Systems.

[8]  Zizhong Chen,et al.  Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods , 2013, PPoPP '13.

[9]  YaoErlin,et al.  Detection of soft errors in LU decomposition with partial pivoting using algorithm-based fault tolerance , 2015 .

[10]  George Bosilca,et al.  CPU-GPU hybrid bidiagonal reduction with soft error resilience , 2013, ScalA '13.

[11]  Padma Raghavan,et al.  Fault tolerant preconditioned conjugate gradient for sparse linear system solution , 2012, ICS '12.

[12]  Robert A. van de Geijn,et al.  Fault-tolerant high-performance matrix multiplication: theory and practice , 2001, 2001 International Conference on Dependable Systems and Networks.

[13]  Rakesh Kumar,et al.  Algorithmic approaches to low overhead fault detection for sparse linear algebra , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[14]  Franck Cappello,et al.  An Efficient Silent Data Corruption Detection Method with Error-Feedback Control and Even Sampling for HPC Applications , 2015, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[15]  Feng Gao,et al.  Fault tolerant matrix-matrix multiplication: correcting soft errors on-line , 2011, ScalA '11.

[16]  Jack J. Dongarra,et al.  High Performance Dense Linear System Solver with Soft Error Resilience , 2011, 2011 IEEE International Conference on Cluster Computing.

[17]  Y. Robert,et al.  Fault-Tolerance Techniques for High-Performance Computing , 2015, Computer Communications and Networks.

[18]  Frank Mueller,et al.  Evaluating the Impact of SDC on the GMRES Iterative Solver , 2013, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[19]  Dingwen Tao,et al.  Extending checksum-based ABFT to tolerate soft errors online in iterative methods , 2014, 2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS).

[20]  Satoshi Matsuoka,et al.  Resilience in Exascale Computing (Dagstuhl Seminar 14402) , 2014, Dagstuhl Reports.

[21]  Franck Cappello,et al.  FTI: High performance Fault Tolerance Interface for hybrid systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[22]  Richard W. Vuduc,et al.  Self-stabilizing iterative solvers , 2013, ScalA '13.

[23]  Dong Li,et al.  Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[24]  Yves Robert,et al.  Fault-Tolerance Techniques for High-Performance Computing , 2015 .

[25]  Thomas Hérault,et al.  Composing resilience techniques: ABFT, periodic and incremental checkpointing , 2015, Int. J. Netw. Comput..

[26]  Vijay S. Pande,et al.  Hard Data on Soft Errors: A Large-Scale Assessment of Real-World Error Rates in GPGPU , 2009, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[27]  Zizhong Chen,et al.  Correcting soft errors online in LU factorization , 2013, HPDC '13.

[28]  Zizhong Chen,et al.  FT-ScaLAPACK: correcting soft errors on-line for ScaLAPACK cholesky, QR, and LU factorization routines , 2014, HPDC '14.

[29]  Zizhong Chen,et al.  On-line soft error correction in matrix-matrix multiplication , 2013, J. Comput. Sci..

[30]  Giuseppe Di Fatta,et al.  Epidemic Fault Tolerance for Extreme-Scale Parallel Computing , 2015, IDCS.

[31]  Aurelien Bouteiller Fault-Tolerant MPI , 2015 .

[32]  Franck Cappello,et al.  Lightweight Silent Data Corruption Detection Based on Runtime Data Analysis for HPC Applications , 2015, HPDC.

[33]  Xin Fu,et al.  Analyzing soft-error vulnerability on GPGPU microarchitecture , 2011, 2011 IEEE International Symposium on Workload Characterization (IISWC).

[34]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[35]  Mingyu Chen,et al.  Detection of soft errors in LU decomposition with partial pivoting using algorithm-based fault tolerance , 2015, Int. J. High Perform. Comput. Appl..

[36]  Suku Nair,et al.  Algorithm-Based Fault Tolerance on a Hypercube Multiprocessor , 1990, IEEE Trans. Computers.

[37]  Thomas Hérault,et al.  Algorithm-Based Fault Tolerance for Dense Matrix Factorizations, Multiple Failures and Accuracy , 2015, ACM Trans. Parallel Comput..