Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction

Progress in numerical weather and climate prediction accuracy greatly depends on the growth of the available computing power. As the number of cores in top computing facilities pushes into the millions, increased average frequency of hardware and software failures forces users to review their algorithms and systems in order to protect simulations from breakdown. This report surveys hardware, application-level and algorithm-level resilience approaches of particular relevance to time-critical numerical weather and climate prediction systems. A selection of applicable existing strategies is analysed, featuring interpolation-restart and compressed checkpointing for the numerical schemes, in-memory checkpointing, user-level failure mitigation and backup-based methods for the systems. Numerical examples showcase the performance of the techniques in addressing faults, with particular emphasis on iterative solvers for linear systems, a staple of atmospheric fluid flow solvers. The potential impact of these strategies is discussed in relation to current development of numerical weather prediction algorithms and systems towards the exascale. Trade-offs between performance, efficiency and effectiveness of resiliency strategies are analysed and some recommendations outlined for future developments.

[1]  David E. Bernholdt,et al.  High Performance Computing Facility Operational Assessment, 2012 Oak Ridge Leadership Computing Facility , 2012 .

[2]  Franck Cappello,et al.  Significantly Improving Lossy Compression for Scientific Data Sets Based on Multidimensional Prediction and Error-Controlled Quantization , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[3]  Richard W. Vuduc,et al.  Self-stabilizing iterative solvers , 2013, ScalA '13.

[4]  Emmanuel Agullo,et al.  On the Resilience of Parallel Sparse Hybrid Solvers , 2015, 2015 IEEE 22nd International Conference on High Performance Computing (HiPC).

[5]  Franck Cappello,et al.  Detecting Silent Data Corruption for Extreme-Scale MPI Applications , 2015, EuroMPI.

[6]  Franck Cappello,et al.  Error-Controlled Lossy Compression Optimized for High Compression Ratios of Scientific Datasets , 2018, 2018 IEEE International Conference on Big Data (Big Data).

[7]  Franck Cappello,et al.  VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale , 2019, 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[8]  Ian Briggs,et al.  Multi-Level Analysis of Compiler-Induced Variability and Performance Tradeoffs , 2019, HPDC.

[9]  Nils Wedi,et al.  Assessing the scales in numerical weather and climate predictions: will exascale be the rescue? , 2019, Philosophical Transactions of the Royal Society A.

[10]  Rupert Klein,et al.  A Blended Soundproof-to-Compressible Numerical Model for Small- to Mesoscale Atmospheric Dynamics , 2014 .

[11]  W. W. Peterson,et al.  Cyclic Codes for Error Detection , 1961, Proceedings of the IRE.

[12]  Meeta Sharma Gupta,et al.  Understanding Soft Error Resiliency of Blue Gene/Q Compute Chip through Hardware Proton Irradiation and Software Fault Injection , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[13]  Franck Cappello,et al.  Exploring Partial Replication to Improve Lightweight Silent Data Corruption Detection for HPC Applications , 2016, Euro-Par.

[14]  Rushil Anirudh,et al.  The Case of Performance Variability on Dragonfly-based Systems , 2020, 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[15]  Christian Engelmann,et al.  Failures in Large Scale Systems: Long-term Measurement, Analysis, and Implications , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[16]  J. Szmelter,et al.  A nonhydrostatic unstructured-mesh soundproof model for simulation of internal gravity waves , 2011 .

[17]  Luca Bonaventura,et al.  Earth system modelling 2: Algorithms, code infrastructure and optimization , 2012 .

[18]  Christian Engelmann,et al.  Shrink or Substitute: Handling Process Failures in HPC Systems Using In-Situ Recovery , 2018, 2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP).

[19]  Philip W. Jones,et al.  The DOE E3SM Coupled Model Version 1: Overview and Evaluation at Standard Resolution , 2019, Journal of Advances in Modeling Earth Systems.

[20]  Jeffrey S. Vetter,et al.  A Survey of Techniques for Modeling and Improving Reliability of Computing Systems , 2016, IEEE Transactions on Parallel and Distributed Systems.

[21]  Emmanuel Agullo,et al.  Numerical recovery strategies for parallel resilient Krylov linear solvers , 2016, Numer. Linear Algebra Appl..

[22]  Takemasa Miyoshi,et al.  Choosing the Optimal Numerical Precision for Data Assimilation in the Presence of Model Error , 2018, Journal of Advances in Modeling Earth Systems.

[23]  Giovanni Tumolo,et al.  A semi‐implicit, semi‐Lagrangian discontinuous Galerkin framework for adaptive numerical weather prediction , 2015 .

[24]  Tobias Gysi,et al.  Towards a performance portable, architecture agnostic implementation strategy for weather and climate models , 2014, Supercomput. Front. Innov..

[25]  Marwa F. Mohamed Service replication taxonomy in distributed environments , 2015, Service Oriented Computing and Applications.

[26]  Franck Cappello,et al.  Detection of Silent Data Corruption in Adaptive Numerical Integration Solvers , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).

[27]  Peter Bauer,et al.  The quiet revolution of numerical weather prediction , 2015, Nature.

[28]  Emmanuel Agullo,et al.  Towards resilient parallel linear Krylov solvers: recover-restart strategies , 2013 .

[29]  Francis X. Giraldo,et al.  Current and Emerging Time-Integration Strategies in Global Numerical Weather and Climate Prediction , 2019 .

[30]  Kurt B. Ferreira,et al.  Fault-tolerant iterative methods via selective reliability. , 2011 .

[31]  Peter D. Düben,et al.  The use of imprecise processing to improve accuracy in weather & climate prediction , 2014, J. Comput. Phys..

[32]  Frank Mueller,et al.  A Numerical Soft Fault Model for Iterative Linear Solvers , 2015, HPDC.

[33]  Ulrich Rüde,et al.  A scalable and extensible checkpointing scheme for massively parallel simulations , 2019, Int. J. High Perform. Comput. Appl..

[34]  P. Bauer,et al.  A Baseline for Global Weather and Climate Simulations at 1 km Resolution , 2020, Journal of Advances in Modeling Earth Systems.

[35]  Franck Cappello,et al.  Fast Error-Bounded Lossy HPC Data Compression with SZ , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[36]  S VetterJeffrey,et al.  A Survey of Software Techniques for Using Non-Volatile Memories for Storage and Main Memory Systems , 2016 .

[37]  T. Benacchio,et al.  A blended semi-implicit numerical model for weakly compressible atmospheric dynamics , 2014 .

[38]  Timothy J. Dell,et al.  A white paper on the benefits of chipkill-correct ecc for pc server main memory , 1997 .

[39]  George Bosilca,et al.  Local rollback for resilient MPI applications with application-level checkpointing and message logging , 2019, Future Gener. Comput. Syst..

[40]  Scott Klasky,et al.  Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[41]  Franck Cappello,et al.  Addressing failures in exascale computing , 2014, Int. J. High Perform. Comput. Appl..

[42]  Franck Cappello,et al.  FTI: High performance Fault Tolerance Interface for hybrid systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[43]  Joanna Szmelter,et al.  FVM 1.0: a nonhydrostatic finite-volume dynamical core for the IFS , 2019, Geoscientific Model Development.

[44]  Daniel S. Katz,et al.  Evaluating Online Global Recovery with Fenix Using Application-Aware In-Memory Checkpointing Techniques , 2016, 2016 45th International Conference on Parallel Processing Workshops (ICPPW).

[45]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2010, IEEE Trans. Dependable Secur. Comput..

[46]  Torsten Hoefler,et al.  Near-global climate simulation at 1 km resolution: establishing a performance baseline on 4888 GPUs with COSMO 5.0 , 2017 .

[47]  Bronis R. de Supinski,et al.  MCREngine: A scalable checkpointing system using data-aware aggregation and compression , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[48]  Fabrizio Petrini,et al.  On the feasibility of incremental checkpointing for scientific computing , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[49]  Robert B. Ross,et al.  Watch Out for the Bully! Job Interference Study on Dragonfly Network , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[50]  Tenkasi V. Ramabadran,et al.  A tutorial on CRC computations , 1988, IEEE Micro.

[51]  Ignacio Laguna,et al.  Reinit++: Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault Tolerance , 2020, ISC.

[52]  Allan S. Nielsen,et al.  Scaling and Resilience in Numerical Algorithms for Exascale Computing , 2018 .

[53]  N. Wedi,et al.  Global simulations of the atmosphere at 1.45 km grid-spacing with the Integrated Forecasting System , 2020, Journal of the Meteorological Society of Japan. Ser. II.

[54]  Gabriel Rodríguez,et al.  CPPC: a compiler‐assisted tool for portable checkpointing of message‐passing applications , 2010, Concurr. Comput. Pract. Exp..

[55]  Emmanuel Agullo,et al.  Interpolation-Restart Strategies for Resilient Eigensolvers , 2016, SIAM J. Sci. Comput..

[56]  Failure Prediction in Hardware Systems , .

[57]  John T. Daly,et al.  A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..

[58]  Christian Engelmann,et al.  Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale , 2016, Supercomput. Front. Innov..

[59]  Laxmikant V. Kalé,et al.  ACR: Automatic checkpoint/restart for soft and hard error protection , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[60]  Frank Mueller,et al.  Evaluating the Impact of SDC on the GMRES Iterative Solver , 2013, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[61]  J. Powis,et al.  [The quiet revolution]. , 1982, Josanpu zasshi = The Japanese journal for midwife.

[62]  Christopher J. Roy,et al.  Review of code and solution verification procedures for computational simulation , 2005 .

[63]  Martin Schulz,et al.  Evaluating User-Level Fault Tolerance for MPI Applications , 2014, EuroMPI/ASIA.

[64]  Canqun Yang,et al.  Toward fault-tolerant hybrid programming over large-scale heterogeneous clusters via checkpointing/restart optimization , 2017, The Journal of Supercomputing.

[65]  M. Saunders,et al.  Solution of Sparse Indefinite Systems of Linear Equations , 1975 .

[66]  Dominik Göddeke,et al.  A High-Level C++ Approach to Manage Local Errors, Asynchrony and Faults in an MPI Application , 2018, 2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP).

[67]  Torsten Hoefler,et al.  Reflecting on the Goal and Baseline for Exascale Computing: A Roadmap Based on Weather and Climate Simulations , 2019, Computing in Science & Engineering.

[68]  Osman S. Unsal,et al.  Unprotected Computing: A Large-Scale Study of DRAM Raw Error Rate on a Supercomputer , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[69]  Robert B. Ross,et al.  Quantifying I/O and Communication Traffic Interference on Dragonfly Networks Equipped with Burst Buffers , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).

[70]  Cosmin Safta,et al.  Fault Resilient Domain Decomposition Preconditioner for PDEs , 2015, SIAM J. Sci. Comput..

[71]  Andrew Dawson,et al.  An approach to secure weather and climate models against hardware faults , 2017 .

[72]  Emmanuel Agullo,et al.  Hard Faults and Soft-Errors: Possible Numerical Remedies in Linear Algebra Solvers , 2016, VECPAR.

[73]  George Bosilca,et al.  Recovery Patterns for Iterative Methods in a Parallel Unstable Environment , 2007, SIAM J. Sci. Comput..

[74]  Noah Evans,et al.  Verifying Qthreads: Is Model Checking Viable for User Level Tasking Runtimes? , 2018, 2018 IEEE/ACM 2nd International Workshop on Software Correctness for HPC Applications (Correctness).

[75]  S. E. Michalak,et al.  Assessment of the Impact of Cosmic-Ray-Induced Neutrons on Hardware in the Roadrunner Supercomputer , 2012, IEEE Transactions on Device and Materials Reliability.

[76]  Gianmarco Mengaldo,et al.  Batch 1: Definition of several Weather & Climate Dwarfs , 2019, ArXiv.

[77]  Sparsh Mittal,et al.  A survey of techniques for improving error-resilience of DRAM , 2018, J. Syst. Archit..

[78]  Torsten Hoefler,et al.  Porting the COSMO Weather Model to Manycore CPUs , 2019, PASC.

[79]  Kurt B. Ferreira,et al.  Fault-tolerant linear solvers via selective reliability , 2012, ArXiv.

[80]  Akihiro Hayashi,et al.  Integrating Inter-Node Communication with a Resilient Asynchronous Many-Task Runtime System , 2020, 2020 Workshop on Exascale MPI (ExaMPI).

[81]  Mattan Erez,et al.  Frugal ECC: efficient and versatile memory error protection through fine-grained compression , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[82]  Cosmin Safta,et al.  Exploring the interplay of resilience and energy consumption for a task-based partial differential equations preconditioner , 2016, Parallel Comput..

[83]  Jack J. Dongarra,et al.  FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World , 2000, PVM/MPI.

[84]  Rolf Riesen,et al.  Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing , 2012, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[85]  Andrew A. Chien,et al.  Versioned Distributed Arrays for Resilience in Scientific Applications: Global View Resilience , 2015, ICCS.

[86]  Valérie Frayssé,et al.  Inexact Matrix-Vector Products in Krylov Methods for Solving Linear Systems: A Relaxation Strategy , 2005, SIAM J. Matrix Anal. Appl..

[87]  Peter Bastian,et al.  The Iterative Solver Template Library , 2006, PARA.

[88]  Emmanuel Agullo,et al.  Block GMRES Method with Inexact Breakdowns and Deflated Restarting , 2014, SIAM J. Matrix Anal. Appl..

[89]  Daniel Thiemert,et al.  The ESCAPE project: Energy-efficient Scalable Algorithms for Weather Prediction at Exascale , 2019 .

[90]  Luca Bonaventura,et al.  Review of numerical methods for nonhydrostatic weather prediction models , 2003 .

[91]  Michael A. Heroux,et al.  Fenix, A Fault Tolerant Programming Framework for MPI Applications , 2016 .

[92]  Jeffrey S. Vetter,et al.  A Survey of Software Techniques for Using Non-Volatile Memories for Storage and Main Memory Systems , 2016, IEEE Transactions on Parallel and Distributed Systems.

[93]  Christopher J. Roy,et al.  A comprehensive framework for verification, validation, and uncertainty quantification in scientific computing , 2011 .

[94]  Karthik Pattabiraman,et al.  Modeling Soft-Error Propagation in Programs , 2018, 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[95]  A. A. White,et al.  Large-Scale Atmosphere–Ocean Dynamics: A view of the equations of meteorological dynamics and various approximations , 2002 .

[96]  Andrew M. Bradley,et al.  HOMMEXX 1.0: a performance-portable atmospheric dynamical core for the Energy Exascale Earth System Model , 2019, Geoscientific Model Development.

[97]  Anthony Skjellum,et al.  Design and Evaluation of FA-MPI, a Transactional Resilience Scheme for Non-blocking MPI , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[98]  Carl E. Landwehr,et al.  Basic concepts and taxonomy of dependable and secure computing , 2004, IEEE Transactions on Dependable and Secure Computing.

[99]  Luca Bonaventura,et al.  Earth System Modelling - Volume 2: Algorithms, Code Infrastructure and Optimisation , 2011 .

[100]  Thomas Hérault,et al.  Algorithm-based fault tolerance for dense matrix factorizations , 2012, PPoPP '12.

[101]  Roberto R. Osorio,et al.  Improving Scalability of Application-Level Checkpoint-Recovery by Reducing Checkpoint Sizes , 2013, New Generation Computing.

[102]  Petar Radojkovic Towards resilient EU HPC systems: a blueprint , 2019, CF.

[103]  Peter D. Düben,et al.  Benchmark Tests for Numerical Weather Forecasts on Inexact Hardware , 2014 .

[104]  Wayne Luk,et al.  Architectures and Precision Analysis for Modelling Atmospheric Variables with Chaotic Behaviour , 2015, 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines.

[105]  Thomas Hérault,et al.  An evaluation of User-Level Failure Mitigation support in MPI , 2012, Computing.

[106]  Thomas Hérault,et al.  Post-failure recovery of MPI communication capability , 2013, Int. J. High Perform. Comput. Appl..

[107]  Franck Cappello,et al.  Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities , 2009, Int. J. High Perform. Comput. Appl..

[108]  Hans Werner Meuer,et al.  Top500 Supercomputer Sites , 1997 .

[109]  Mattan Erez,et al.  Bamboo ECC: Strong, safe, and flexible codes for reliable computer memory , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[110]  James Demmel,et al.  Parallel Reproducible Summation , 2015, IEEE Transactions on Computers.

[111]  Y. Robert,et al.  Fault-Tolerance Techniques for High-Performance Computing , 2015, Computer Communications and Networks.

[112]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[113]  Katherine E. Isaacs,et al.  There goes the neighborhood: Performance degradation due to nearby jobs , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[114]  Yves Robert,et al.  Fault-Tolerance Techniques for High-Performance Computing , 2015 .

[115]  P. K. Smolarkiewicz,et al.  VARIATIONAL METHODS FOR ELLIPTIC PROBLEMS IN FLUID MODELS , 2000 .

[116]  Ravishankar K. Iyer,et al.  Measuring and Understanding Extreme-Scale Application Resilience: A Field Study of 5,000,000 HPC Application Runs , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[117]  Joanna Szmelter,et al.  FVM 1.0: A nonhydrostatic finite-volume dynamical core formulation for IFS , 2018 .

[118]  Peter D. Düben,et al.  On the use of inexact, pruned hardware in atmospheric modelling , 2014, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[119]  William Gropp,et al.  Towards a More Complete Understanding of SDC Propagation , 2017, HPDC.

[120]  J. Szmelter,et al.  MPDATA: An edge-based unstructured-grid formulation , 2005 .

[121]  Franck Cappello,et al.  Identifying the Right Replication Level to Detect and Correct Silent Errors at Scale , 2017, FTXS '17.

[122]  Guangyu Sun,et al.  Exploring Memory Hierarchy Design with Emerging Memory Technologies , 2013, Lecture Notes in Electrical Engineering.

[123]  Manish Parashar,et al.  Specification of Fenix MPI Fault Tolerance library version 0.9. , 2016 .

[124]  Franck Cappello,et al.  Toward Exascale Resilience: 2014 update , 2014, Supercomput. Front. Innov..

[125]  Luigi Carro,et al.  Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[126]  Song Fu,et al.  Characterizing and Modeling Reliability of Declustered RAID for HPC Storage Systems , 2019, 2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks – Industry Track.

[127]  Dejan S. Milojicic,et al.  Optimizing Checkpoints Using NVM as Virtual Memory , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[128]  Amin Ansari,et al.  Shoestring: probabilistic soft error reliability on the cheap , 2010, ASPLOS XV.

[129]  Franck Cappello,et al.  Improving performance of iterative methods by lossy checkponting , 2018, HPDC.

[130]  Michela Taufer,et al.  On the Need for Reproducible Numerical Accuracy through Intelligent Runtime Selection of Reduction Algorithms at the Extreme Scale , 2015, 2015 IEEE International Conference on Cluster Computing.

[131]  Chris D. Cantwell,et al.  A Minimally Intrusive Low-Memory Approach to Resilience for Existing Transient Solvers , 2018, J. Sci. Comput..

[132]  Bernd Mohr,et al.  Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis , 2017 .

[133]  Manish Parashar,et al.  Scalable Failure Masking for Stencil Computations using Ghost Region Expansion and Cell to Rank Remapping , 2017, SIAM J. Sci. Comput..

[134]  Eike Hermann Müller,et al.  LFRic: Meeting the challenges of scalability and performance portability in Weather and Climate models , 2018, J. Parallel Distributed Comput..

[135]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[136]  Martin Schulz,et al.  Evaluating and extending user-level fault tolerance in MPI applications , 2016, Int. J. High Perform. Comput. Appl..

[137]  Micah Beck,et al.  Compiler-Assisted Memory Exclusion for Fast Checkpointing , 1995 .

[138]  Manish Parashar,et al.  Modeling and Simulating Multiple Failure Masking Enabled by Local Recovery for Stencil-Based Applications at Extreme Scales , 2017, IEEE Transactions on Parallel and Distributed Systems.

[139]  Rachid Guerraoui,et al.  Software-Based Replication for Fault Tolerance , 1997, Computer.

[140]  Peter D. Düben,et al.  Rounding errors may be beneficial for simulations of atmospheric flow: results from the forced 1D Burgers equation , 2015 .

[141]  Yuichi Inadomi,et al.  Analyzing and mitigating the impact of manufacturing variability in power-constrained supercomputing , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[142]  Peter D. Düben,et al.  Reliable low precision simulations in land surface models , 2018, Climate Dynamics.

[143]  Y. Saad,et al.  GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems , 1986 .

[144]  George Bosilca,et al.  Fault tolerance of MPI applications in exascale systems: The ULFM solution , 2020, Future Gener. Comput. Syst..

[145]  Barbara I. Wohlmuth,et al.  Resilience for Massively Parallel Multigrid Solvers , 2016, SIAM J. Sci. Comput..

[146]  Richard M. Karp,et al.  Algorithmic methodologies for ultra-efficient inexact architectures for sustaining technology scaling , 2012, CF '12.

[147]  Peter Bastian,et al.  Generic implementation of finite element methods in the Distributed and Unified Numerics Environment (DUNE) , 2010, Kybernetika.

[148]  Md. Mohsin Ali,et al.  Application Level Fault Recovery: Using Fault-Tolerant Open MPI in a PDE Solver , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.

[149]  John Sartori,et al.  Stochastic computing: Embracing errors in architecture and design of processors and applications , 2011, 2011 Proceedings of the 14th International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES).

[150]  高等学校計算数学学報編輯委員会編,et al.  高等学校計算数学学報 = Numerical mathematics , 1979 .

[151]  Dong Li,et al.  Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[152]  Robert E. Lyons,et al.  The Use of Triple-Modular Redundancy to Improve Computer Reliability , 1962, IBM J. Res. Dev..

[153]  Martin Leutbecher,et al.  On the probabilistic skill of dual‐resolution ensemble forecasts , 2019, Quarterly Journal of the Royal Meteorological Society.

[154]  Dominik Göddeke,et al.  Soft fault detection and correction for multigrid , 2018, Int. J. High Perform. Comput. Appl..

[155]  Andreas Dedner,et al.  The Distributed and Unified Numerics Environment,Version 2.4 , 2016 .

[156]  Michael A. Heroux,et al.  Toward Local Failure Local Recovery Resilience Model using MPI-ULFM , 2014, EuroMPI/ASIA.

[157]  William Gropp,et al.  Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , 2013, HiPC 2013.

[158]  Dirk Ribbrock,et al.  Fault-tolerant finite-element multigrid algorithms with hierarchically compressed asynchronous checkpointing , 2015, Parallel Comput..

[159]  Franck Cappello,et al.  Lightweight Silent Data Corruption Detection Based on Runtime Data Analysis for HPC Applications , 2015, HPDC.