Resilience for Stencil Computations with Latent Errors

Projections and measurements of error rates in near-exascale and exascale systems suggest a dramatic growth, due to extreme scale (10^9 cores), concurrency, software complexity, and deep submicron transistor scaling. Such a growth makes resilience a critical concern, and may increase the incidence of errors that "escape", silently corrupting application state. Such errors can often be revealed by application software tests but with long latencies, and thus are known as latent errors. We explore how to efficiently recover from latent errors, with an approach called application-based focused recovery (ABFR). Specifically we present a case study of stencil computations, a widely useful computational structure, showing how ABFR focuses recovery effort where needed, using intelligent testing and pruning to reduce recovery effort, and enables recovery effort to be overlapped with application computation. We analyze and characterize the ABFR approach on stencils, creating a performance model parameterized by error rate and detection interval (latency). We compare projections from the model to experimental results with the Chombo stencil application, validating the model and showing that ABFR on stencil can achieve a significant reductions in error recovery cost (up to 400x) and recovery latency (up to 4x). Such reductions enable efficient execution at scale with high latent error rates.

[1]  Franck Cappello,et al.  Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities , 2009, Int. J. High Perform. Comput. Appl..

[2]  Andrew A. Chien,et al.  Applying GVR to Molecular Dynamics : Enabling Resilience for Scientific Computations , 2014 .

[3]  Franck Cappello,et al.  Lightweight Silent Data Corruption Detection Based on Runtime Data Analysis for HPC Applications , 2015, HPDC.

[4]  Thomas Hérault,et al.  On the Combination of Silent Error Detection and Checkpointing , 2013, 2013 IEEE 19th Pacific Rim International Symposium on Dependable Computing.

[5]  Ravishankar K. Iyer,et al.  Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[6]  Samuel Williams,et al.  Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors , 2007, SIAM Rev..

[7]  Ravishankar K. Iyer,et al.  Measuring and Understanding Extreme-Scale Application Resilience: A Field Study of 5,000,000 HPC Application Runs , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[8]  Mark F. Adams,et al.  Chombo Software Package for AMR Applications Design Document , 2014 .

[9]  Andrew A. Chien,et al.  Multi-versioning Performance Opportunities in BGAS System for Resilience , 2016, ISC.

[10]  Franck Cappello,et al.  Addressing failures in exascale computing , 2014, Int. J. High Perform. Comput. Appl..

[11]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[12]  James H. Laros,et al.  Evaluating the viability of process replication reliability for exascale systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[13]  Bronis R. de Supinski,et al.  Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[14]  Fujita Hajime,et al.  Granularity and the Cost of Error Recovery in Resilient AMR Scientific Applications , 2016 .

[15]  Vivek Sarkar,et al.  Software challenges in extreme scale systems , 2009 .

[16]  Padma Raghavan,et al.  Fault tolerant preconditioned conjugate gradient for sparse linear system solution , 2012, ICS '12.

[17]  Andrew A. Chien,et al.  Versioned Distributed Arrays for Resilience in Scientific Applications: Global View Resilience , 2015, ICCS.

[18]  James F. Epperson,et al.  An Introduction to Numerical Methods and Analysis , 2001 .

[19]  John T. Daly,et al.  A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..

[20]  Franck Cappello,et al.  Toward Exascale Resilience: 2014 update , 2014, Supercomput. Front. Innov..

[21]  Erol Gelenbe,et al.  A model of roll-back recovery with multiple checkpoints , 1976, ICSE '76.

[22]  Thomas Hérault,et al.  Algorithm-based fault tolerance for dense matrix factorizations , 2012, PPoPP '12.

[23]  Andrew A. Chien,et al.  Data decomposition in Monte Carlo neutron transport simulations using global view arrays , 2015, Int. J. High Perform. Comput. Appl..

[24]  Zizhong Chen,et al.  Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods , 2013, PPoPP '13.

[25]  Manish Parashar,et al.  Local recovery and failure masking for stencil-based applications at extreme scales , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[26]  Andrew A. Chien,et al.  When is multi-version checkpointing needed? , 2013, FTXS '13.

[27]  G. Bronevetsky,et al.  Detecting Soft Errors in Stencil based Computations , 2015 .

[28]  Franck Cappello,et al.  FTI: High performance Fault Tolerance Interface for hybrid systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[29]  Austin R. Benson,et al.  Silent error detection in numerical time-stepping schemes , 2015, Int. J. High Perform. Comput. Appl..