Exploiting asynchrony from exact forward recovery for DUE in iterative solvers

This paper presents a method to protect iterative solvers from Detected and Uncorrected Errors (DUE) relying on error detection techniques already available in commodity hardware. Detection operates at the memory page level, which enables the use of simple algorithmic redundancies to correct errors. Such redundancies would be inapplicable under coarse grain error detection, but become very powerful when the hardware is able to precisely detect errors. Relations straightforwardly extracted from the solver allow to recover lost data exactly. This method is free of the overheads of backwards recoveries like checkpointing, and does not compromise mathematical convergence properties of the solver as restarting would do. We apply this recovery to three widely used Krylov subspace methods, CG, GMRES and BiCGStab, and their preconditioned versions. We implement our resilience techniques on CG considering scenarios from small (8 cores) to large (1024 cores) scales, and demonstrate very low overheads compared to state-of-the-art solutions. We deploy our recovery techniques either by overlapping them with algorithmic computations or by forcing them to be in the critical path of the application. A trade-off exists between both approaches depending on the error rate the solver is suffering. Under realistic error rates, overlapping decreases overheads from 5.37% down to 3.59% for a non-preconditioned CG on 8 cores.

[1]  Zizhong Chen Algorithm-based recovery for iterative methods without checkpointing , 2011, HPDC '11.

[2]  Eduardo Pinheiro,et al.  DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.

[3]  Osman S. Unsal,et al.  NanoCheckpoints: A Task-Based Asynchronous Dataflow Framework for Efficient and Scalable Checkpoint/Restart , 2015, 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.

[4]  Franck Cappello,et al.  Toward Exascale Resilience , 2009, Int. J. High Perform. Comput. Appl..

[5]  Milo M. K. Martin,et al.  SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[6]  A. Kleen mcelog : memory error handling in user space , 2010 .

[7]  Emmanuel Agullo,et al.  Towards resilient parallel linear Krylov solvers: recover-restart strategies , 2013 .

[8]  Doe Hyun Yoon,et al.  Virtualized and flexible ECC for main memory , 2010, ASPLOS XV.

[9]  Richard Barrett,et al.  Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods , 1994, Other Titles in Applied Mathematics.

[10]  Omer Subasi,et al.  Leveraging a Task-based Asynchronous Dataflow Substrate for Efficient and Scalable Resiliency , 2014 .

[11]  Jack Dongarra,et al.  HPCG Benchmark Technical Specification , 2013 .

[12]  Thomas Hérault,et al.  Extending the scope of the Checkpoint‐on‐Failure protocol for forward recovery in standard MPI , 2013, Concurr. Comput. Pract. Exp..

[13]  Dong Tang,et al.  Assessment of the Effect of Memory Page Retirement on System RAS Against Hardware Faults , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[14]  Narayanan Vijaykrishnan,et al.  The effect of threshold voltages on the soft error rate [memory and logic circuits] , 2004, International Symposium on Signals, Circuits and Systems. Proceedings, SCS 2003. (Cat. No.03EX720).

[15]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[16]  Irving S. Reed,et al.  A class of multiple-error-correcting codes and the decoding scheme , 1954, Trans. IRE Prof. Group Inf. Theory.

[17]  Alejandro Duran,et al.  Ompss: a Proposal for Programming Heterogeneous Multi-Core Architectures , 2011, Parallel Process. Lett..

[18]  Bronis R. de Supinski,et al.  Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[19]  Todd M. Austin,et al.  A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor , 2003, MICRO.

[20]  Kurt B. Ferreira,et al.  Fault-tolerant iterative methods via selective reliability. , 2011 .

[21]  Mark Anders,et al.  Near-threshold voltage (NTV) design — Opportunities and challenges , 2012, DAC Design Automation Conference 2012.

[22]  P. Strevens Iii , 1985 .

[23]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[24]  Zizhong Chen,et al.  Correcting soft errors online in LU factorization , 2013, HPDC '13.

[25]  J. Shewchuk An Introduction to the Conjugate Gradient Method Without the Agonizing Pain , 1994 .

[26]  George Bosilca,et al.  Recovery Patterns for Iterative Methods in a Parallel Unstable Environment , 2007, SIAM J. Sci. Comput..

[27]  Xin Li,et al.  A Realistic Evaluation of Memory Hardware Errors and Software System Susceptibility , 2010, USENIX Annual Technical Conference.

[28]  B R de Supinski,et al.  Detailed Modeling, Design, and Evaluation of a Scalable Multi-level Checkpointing System , 2010 .

[29]  Zizhong Chen,et al.  Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods , 2013, PPoPP '13.

[30]  Yun Zhou,et al.  The Reliability Wall for Exascale Supercomputing , 2012, IEEE Transactions on Computers.

[31]  Ron Brightwell,et al.  Cooperative Application/OS DRAM Fault Recovery , 2011, Euro-Par Workshops.

[32]  Tamara G. Kolda,et al.  An overview of the Trilinos project , 2005, TOMS.

[33]  Mattan Erez,et al.  Bamboo ECC: Strong, safe, and flexible codes for reliable computer memory , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[34]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[35]  Martin Schulz,et al.  Fault resilience of the algebraic multi-grid solver , 2012, ICS '12.

[36]  James H. Laros,et al.  Evaluating the viability of process replication reliability for exascale systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[37]  Henri Casanova,et al.  Checkpointing strategies for parallel jobs , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[38]  Franck Cappello,et al.  Toward Exascale Resilience: 2014 update , 2014, Supercomput. Front. Innov..

[39]  Jack J. Dongarra,et al.  Fault Tolerant MPI for the HARNESS Meta-computing System , 2001, International Conference on Computational Science.

[40]  Alejandro Duran,et al.  Towards an Error Model for OpenMP , 2010, IWOMP.

[41]  Sandia Report,et al.  HPCG Technical Specification , 2013 .