Combining Backward and Forward Recovery to Cope with Silent Errors in Iterative Solvers

Several recent papers have introduced a periodic verification mechanism to detect silent errors in iterative solvers. Chen [PPoPP'13, pp. 167 -- 176] has shown how to combine such a verification mechanism (a stability test checking the orthogonality of two vectors and recomputing the residual) with check pointing: the idea is to verify every d iterations, and to checkpoint every c × d iterations. When a silent error is detected by the verification mechanism, one can rollback to, and re-execute from, the last checkpoint. In this paper, we also propose to combine check pointing and verification, but we use ABFT rather than stability tests. ABFT can be used for error detection, but also for error detection and correction, allowing a forward recovery (and no rollback nor re-execution) when a single error is detected. We introduce an abstract performance model to compute the performance of all schemes, and we instantiate it using the Conjugate Gradient algorithm. Finally, we validate our new approach through a set of simulations.

[1]  John T. Daly,et al.  A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..

[2]  Zizhong Chen,et al.  Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods , 2013, PPoPP '13.

[3]  Richard W. Vuduc,et al.  Self-stabilizing iterative solvers , 2013, ScalA '13.

[4]  Nicholas J. Higham,et al.  Functions of matrices - theory and computation , 2008 .

[5]  Bianca Schroeder,et al.  Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design , 2012, ASPLOS XVII.

[6]  Yves Robert,et al.  Checkpointing algorithms and fault prediction , 2014, J. Parallel Distributed Comput..

[7]  Laplacian Matrix , 2017, Encyclopedia of Machine Learning and Data Mining.

[8]  Austin R. Benson,et al.  Silent error detection in numerical time-stepping schemes , 2015, Int. J. High Perform. Comput. Appl..

[9]  Yves Robert,et al.  {Combining Algorithm-based Fault Tolerance and Checkpointing for Iterative Solvers} , 2015 .

[10]  Kurt B. Ferreira,et al.  Fault-tolerant iterative methods via selective reliability. , 2011 .

[11]  F. Mueller,et al.  Quantifying the Impact of Single Bit Flips on Floating Point Arithmetic , 2013 .

[12]  Rolf Riesen,et al.  Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing , 2012, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[13]  Kurt B. Ferreira,et al.  Fault-tolerant linear solvers via selective reliability , 2012, ArXiv.

[14]  DongarraJack,et al.  Algorithm-based fault tolerance for dense matrix factorizations , 2012 .

[15]  John W. Young,et al.  A first order approximation to the optimum checkpoint interval , 1974, CACM.

[16]  Eli Upfal,et al.  Probability and Computing: Randomized Algorithms and Probabilistic Analysis , 2005 .

[17]  Robert E. Lyons,et al.  The Use of Triple-Modular Redundancy to Improve Computer Reliability , 1962, IBM J. Res. Dev..

[18]  Andrew A. Chien,et al.  When is multi-version checkpointing needed? , 2013, FTXS '13.

[19]  Yves Robert,et al.  Assessing General-Purpose Algorithms to Cope with Fail-Stop and Silent Errors , 2016, TOPC.

[20]  Rakesh Kumar,et al.  Algorithmic approaches to low overhead fault detection for sparse linear algebra , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[21]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[22]  George Bosilca,et al.  Algorithm-based fault tolerance applied to high performance computing , 2009, J. Parallel Distributed Comput..

[23]  Henri Casanova,et al.  Checkpointing strategies for parallel jobs , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[24]  Bronis R. de Supinski,et al.  Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[25]  Christian Engelmann,et al.  Combining Partial Redundancy and Checkpointing for HPC , 2012, 2012 IEEE 32nd International Conference on Distributed Computing Systems.

[26]  Nicholas J. Higham,et al.  INVERSE PROBLEMS NEWSLETTER , 1991 .

[27]  Thomas Hérault,et al.  Algorithm-based fault tolerance for dense matrix factorizations , 2012, PPoPP '12.

[28]  Bronis R. de Supinski,et al.  Soft error vulnerability of iterative linear algebra methods , 2007, ICS '08.

[29]  Franklin T. Luk,et al.  A Linear Algebraic Model of Algorithm-Based Fault Tolerance , 1988, IEEE Trans. Computers.

[30]  Padma Raghavan,et al.  Fault tolerant preconditioned conjugate gradient for sparse linear system solution , 2012, ICS '12.

[31]  Bora Uçar,et al.  On analysis of partitioning models and metrics in parallel sparse matrix-vector multiplication , 2013 .

[32]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[33]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.