Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction
暂无分享,去创建一个
Peter D. Düben | Dominik Göddeke | Luc Giraud | Chris D. Cantwell | Erwan Raffin | Mike Gillard | Keita Teranishi | Nils Wedi | Luca Bonaventura | Mirco Altenbernd | Tommaso Benacchio | L. Giraud | K. Teranishi | N. Wedi | Luca Bonaventura | C. Cantwell | P. Düben | M. Gillard | D. Göddeke | Tommaso Benacchio | E. Raffin | Mirco Altenbernd
[1] David E. Bernholdt,et al. High Performance Computing Facility Operational Assessment, 2012 Oak Ridge Leadership Computing Facility , 2012 .
[2] Franck Cappello,et al. Significantly Improving Lossy Compression for Scientific Data Sets Based on Multidimensional Prediction and Error-Controlled Quantization , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).
[3] Richard W. Vuduc,et al. Self-stabilizing iterative solvers , 2013, ScalA '13.
[4] Emmanuel Agullo,et al. On the Resilience of Parallel Sparse Hybrid Solvers , 2015, 2015 IEEE 22nd International Conference on High Performance Computing (HiPC).
[5] Franck Cappello,et al. Detecting Silent Data Corruption for Extreme-Scale MPI Applications , 2015, EuroMPI.
[6] Franck Cappello,et al. Error-Controlled Lossy Compression Optimized for High Compression Ratios of Scientific Datasets , 2018, 2018 IEEE International Conference on Big Data (Big Data).
[7] Franck Cappello,et al. VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale , 2019, 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).
[8] Ian Briggs,et al. Multi-Level Analysis of Compiler-Induced Variability and Performance Tradeoffs , 2019, HPDC.
[9] Nils Wedi,et al. Assessing the scales in numerical weather and climate predictions: will exascale be the rescue? , 2019, Philosophical Transactions of the Royal Society A.
[10] Rupert Klein,et al. A Blended Soundproof-to-Compressible Numerical Model for Small- to Mesoscale Atmospheric Dynamics , 2014 .
[11] W. W. Peterson,et al. Cyclic Codes for Error Detection , 1961, Proceedings of the IRE.
[12] Meeta Sharma Gupta,et al. Understanding Soft Error Resiliency of Blue Gene/Q Compute Chip through Hardware Proton Irradiation and Software Fault Injection , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.
[13] Franck Cappello,et al. Exploring Partial Replication to Improve Lightweight Silent Data Corruption Detection for HPC Applications , 2016, Euro-Par.
[14] Rushil Anirudh,et al. The Case of Performance Variability on Dragonfly-based Systems , 2020, 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS).
[15] Christian Engelmann,et al. Failures in Large Scale Systems: Long-term Measurement, Analysis, and Implications , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.
[16] J. Szmelter,et al. A nonhydrostatic unstructured-mesh soundproof model for simulation of internal gravity waves , 2011 .
[17] Luca Bonaventura,et al. Earth system modelling 2: Algorithms, code infrastructure and optimization , 2012 .
[18] Christian Engelmann,et al. Shrink or Substitute: Handling Process Failures in HPC Systems Using In-Situ Recovery , 2018, 2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP).
[19] Philip W. Jones,et al. The DOE E3SM Coupled Model Version 1: Overview and Evaluation at Standard Resolution , 2019, Journal of Advances in Modeling Earth Systems.
[20] Jeffrey S. Vetter,et al. A Survey of Techniques for Modeling and Improving Reliability of Computing Systems , 2016, IEEE Transactions on Parallel and Distributed Systems.
[21] Emmanuel Agullo,et al. Numerical recovery strategies for parallel resilient Krylov linear solvers , 2016, Numer. Linear Algebra Appl..
[22] Takemasa Miyoshi,et al. Choosing the Optimal Numerical Precision for Data Assimilation in the Presence of Model Error , 2018, Journal of Advances in Modeling Earth Systems.
[23] Giovanni Tumolo,et al. A semi‐implicit, semi‐Lagrangian discontinuous Galerkin framework for adaptive numerical weather prediction , 2015 .
[24] Tobias Gysi,et al. Towards a performance portable, architecture agnostic implementation strategy for weather and climate models , 2014, Supercomput. Front. Innov..
[25] Marwa F. Mohamed. Service replication taxonomy in distributed environments , 2015, Service Oriented Computing and Applications.
[26] Franck Cappello,et al. Detection of Silent Data Corruption in Adaptive Numerical Integration Solvers , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).
[27] Peter Bauer,et al. The quiet revolution of numerical weather prediction , 2015, Nature.
[28] Emmanuel Agullo,et al. Towards resilient parallel linear Krylov solvers: recover-restart strategies , 2013 .
[29] Francis X. Giraldo,et al. Current and Emerging Time-Integration Strategies in Global Numerical Weather and Climate Prediction , 2019 .
[30] Kurt B. Ferreira,et al. Fault-tolerant iterative methods via selective reliability. , 2011 .
[31] Peter D. Düben,et al. The use of imprecise processing to improve accuracy in weather & climate prediction , 2014, J. Comput. Phys..
[32] Frank Mueller,et al. A Numerical Soft Fault Model for Iterative Linear Solvers , 2015, HPDC.
[33] Ulrich Rüde,et al. A scalable and extensible checkpointing scheme for massively parallel simulations , 2019, Int. J. High Perform. Comput. Appl..
[34] P. Bauer,et al. A Baseline for Global Weather and Climate Simulations at 1 km Resolution , 2020, Journal of Advances in Modeling Earth Systems.
[35] Franck Cappello,et al. Fast Error-Bounded Lossy HPC Data Compression with SZ , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).
[36] S VetterJeffrey,et al. A Survey of Software Techniques for Using Non-Volatile Memories for Storage and Main Memory Systems , 2016 .
[37] T. Benacchio,et al. A blended semi-implicit numerical model for weakly compressible atmospheric dynamics , 2014 .
[38] Timothy J. Dell,et al. A white paper on the benefits of chipkill-correct ecc for pc server main memory , 1997 .
[39] George Bosilca,et al. Local rollback for resilient MPI applications with application-level checkpointing and message logging , 2019, Future Gener. Comput. Syst..
[40] Scott Klasky,et al. Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.
[41] Franck Cappello,et al. Addressing failures in exascale computing , 2014, Int. J. High Perform. Comput. Appl..
[42] Franck Cappello,et al. FTI: High performance Fault Tolerance Interface for hybrid systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[43] Joanna Szmelter,et al. FVM 1.0: a nonhydrostatic finite-volume dynamical core for the IFS , 2019, Geoscientific Model Development.
[44] Daniel S. Katz,et al. Evaluating Online Global Recovery with Fenix Using Application-Aware In-Memory Checkpointing Techniques , 2016, 2016 45th International Conference on Parallel Processing Workshops (ICPPW).
[45] Bianca Schroeder,et al. A Large-Scale Study of Failures in High-Performance Computing Systems , 2010, IEEE Trans. Dependable Secur. Comput..
[46] Torsten Hoefler,et al. Near-global climate simulation at 1 km resolution: establishing a performance baseline on 4888 GPUs with COSMO 5.0 , 2017 .
[47] Bronis R. de Supinski,et al. MCREngine: A scalable checkpointing system using data-aware aggregation and compression , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.
[48] Fabrizio Petrini,et al. On the feasibility of incremental checkpointing for scientific computing , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..
[49] Robert B. Ross,et al. Watch Out for the Bully! Job Interference Study on Dragonfly Network , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.
[50] Tenkasi V. Ramabadran,et al. A tutorial on CRC computations , 1988, IEEE Micro.
[51] Ignacio Laguna,et al. Reinit++: Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault Tolerance , 2020, ISC.
[52] Allan S. Nielsen,et al. Scaling and Resilience in Numerical Algorithms for Exascale Computing , 2018 .
[53] N. Wedi,et al. Global simulations of the atmosphere at 1.45 km grid-spacing with the Integrated Forecasting System , 2020, Journal of the Meteorological Society of Japan. Ser. II.
[54] Gabriel Rodríguez,et al. CPPC: a compiler‐assisted tool for portable checkpointing of message‐passing applications , 2010, Concurr. Comput. Pract. Exp..
[55] Emmanuel Agullo,et al. Interpolation-Restart Strategies for Resilient Eigensolvers , 2016, SIAM J. Sci. Comput..
[56] Failure Prediction in Hardware Systems , .
[57] John T. Daly,et al. A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..
[58] Christian Engelmann,et al. Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale , 2016, Supercomput. Front. Innov..
[59] Laxmikant V. Kalé,et al. ACR: Automatic checkpoint/restart for soft and hard error protection , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[60] Frank Mueller,et al. Evaluating the Impact of SDC on the GMRES Iterative Solver , 2013, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.
[61] J. Powis,et al. [The quiet revolution]. , 1982, Josanpu zasshi = The Japanese journal for midwife.
[62] Christopher J. Roy,et al. Review of code and solution verification procedures for computational simulation , 2005 .
[63] Martin Schulz,et al. Evaluating User-Level Fault Tolerance for MPI Applications , 2014, EuroMPI/ASIA.
[64] Canqun Yang,et al. Toward fault-tolerant hybrid programming over large-scale heterogeneous clusters via checkpointing/restart optimization , 2017, The Journal of Supercomputing.
[65] M. Saunders,et al. Solution of Sparse Indefinite Systems of Linear Equations , 1975 .
[66] Dominik Göddeke,et al. A High-Level C++ Approach to Manage Local Errors, Asynchrony and Faults in an MPI Application , 2018, 2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP).
[67] Torsten Hoefler,et al. Reflecting on the Goal and Baseline for Exascale Computing: A Roadmap Based on Weather and Climate Simulations , 2019, Computing in Science & Engineering.
[68] Osman S. Unsal,et al. Unprotected Computing: A Large-Scale Study of DRAM Raw Error Rate on a Supercomputer , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.
[69] Robert B. Ross,et al. Quantifying I/O and Communication Traffic Interference on Dragonfly Networks Equipped with Burst Buffers , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).
[70] Cosmin Safta,et al. Fault Resilient Domain Decomposition Preconditioner for PDEs , 2015, SIAM J. Sci. Comput..
[71] Andrew Dawson,et al. An approach to secure weather and climate models against hardware faults , 2017 .
[72] Emmanuel Agullo,et al. Hard Faults and Soft-Errors: Possible Numerical Remedies in Linear Algebra Solvers , 2016, VECPAR.
[73] George Bosilca,et al. Recovery Patterns for Iterative Methods in a Parallel Unstable Environment , 2007, SIAM J. Sci. Comput..
[74] Noah Evans,et al. Verifying Qthreads: Is Model Checking Viable for User Level Tasking Runtimes? , 2018, 2018 IEEE/ACM 2nd International Workshop on Software Correctness for HPC Applications (Correctness).
[75] S. E. Michalak,et al. Assessment of the Impact of Cosmic-Ray-Induced Neutrons on Hardware in the Roadrunner Supercomputer , 2012, IEEE Transactions on Device and Materials Reliability.
[76] Gianmarco Mengaldo,et al. Batch 1: Definition of several Weather & Climate Dwarfs , 2019, ArXiv.
[77] Sparsh Mittal,et al. A survey of techniques for improving error-resilience of DRAM , 2018, J. Syst. Archit..
[78] Torsten Hoefler,et al. Porting the COSMO Weather Model to Manycore CPUs , 2019, PASC.
[79] Kurt B. Ferreira,et al. Fault-tolerant linear solvers via selective reliability , 2012, ArXiv.
[80] Akihiro Hayashi,et al. Integrating Inter-Node Communication with a Resilient Asynchronous Many-Task Runtime System , 2020, 2020 Workshop on Exascale MPI (ExaMPI).
[81] Mattan Erez,et al. Frugal ECC: efficient and versatile memory error protection through fine-grained compression , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.
[82] Cosmin Safta,et al. Exploring the interplay of resilience and energy consumption for a task-based partial differential equations preconditioner , 2016, Parallel Comput..
[83] Jack J. Dongarra,et al. FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World , 2000, PVM/MPI.
[84] Rolf Riesen,et al. Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing , 2012, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.
[85] Andrew A. Chien,et al. Versioned Distributed Arrays for Resilience in Scientific Applications: Global View Resilience , 2015, ICCS.
[86] Valérie Frayssé,et al. Inexact Matrix-Vector Products in Krylov Methods for Solving Linear Systems: A Relaxation Strategy , 2005, SIAM J. Matrix Anal. Appl..
[87] Peter Bastian,et al. The Iterative Solver Template Library , 2006, PARA.
[88] Emmanuel Agullo,et al. Block GMRES Method with Inexact Breakdowns and Deflated Restarting , 2014, SIAM J. Matrix Anal. Appl..
[89] Daniel Thiemert,et al. The ESCAPE project: Energy-efficient Scalable Algorithms for Weather Prediction at Exascale , 2019 .
[90] Luca Bonaventura,et al. Review of numerical methods for nonhydrostatic weather prediction models , 2003 .
[91] Michael A. Heroux,et al. Fenix, A Fault Tolerant Programming Framework for MPI Applications , 2016 .
[92] Jeffrey S. Vetter,et al. A Survey of Software Techniques for Using Non-Volatile Memories for Storage and Main Memory Systems , 2016, IEEE Transactions on Parallel and Distributed Systems.
[93] Christopher J. Roy,et al. A comprehensive framework for verification, validation, and uncertainty quantification in scientific computing , 2011 .
[94] Karthik Pattabiraman,et al. Modeling Soft-Error Propagation in Programs , 2018, 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).
[95] A. A. White,et al. Large-Scale Atmosphere–Ocean Dynamics: A view of the equations of meteorological dynamics and various approximations , 2002 .
[96] Andrew M. Bradley,et al. HOMMEXX 1.0: a performance-portable atmospheric dynamical core for the Energy Exascale Earth System Model , 2019, Geoscientific Model Development.
[97] Anthony Skjellum,et al. Design and Evaluation of FA-MPI, a Transactional Resilience Scheme for Non-blocking MPI , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.
[98] Carl E. Landwehr,et al. Basic concepts and taxonomy of dependable and secure computing , 2004, IEEE Transactions on Dependable and Secure Computing.
[99] Luca Bonaventura,et al. Earth System Modelling - Volume 2: Algorithms, Code Infrastructure and Optimisation , 2011 .
[100] Thomas Hérault,et al. Algorithm-based fault tolerance for dense matrix factorizations , 2012, PPoPP '12.
[101] Roberto R. Osorio,et al. Improving Scalability of Application-Level Checkpoint-Recovery by Reducing Checkpoint Sizes , 2013, New Generation Computing.
[102] Petar Radojkovic. Towards resilient EU HPC systems: a blueprint , 2019, CF.
[103] Peter D. Düben,et al. Benchmark Tests for Numerical Weather Forecasts on Inexact Hardware , 2014 .
[104] Wayne Luk,et al. Architectures and Precision Analysis for Modelling Atmospheric Variables with Chaotic Behaviour , 2015, 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines.
[105] Thomas Hérault,et al. An evaluation of User-Level Failure Mitigation support in MPI , 2012, Computing.
[106] Thomas Hérault,et al. Post-failure recovery of MPI communication capability , 2013, Int. J. High Perform. Comput. Appl..
[107] Franck Cappello,et al. Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities , 2009, Int. J. High Perform. Comput. Appl..
[108] Hans Werner Meuer,et al. Top500 Supercomputer Sites , 1997 .
[109] Mattan Erez,et al. Bamboo ECC: Strong, safe, and flexible codes for reliable computer memory , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).
[110] James Demmel,et al. Parallel Reproducible Summation , 2015, IEEE Transactions on Computers.
[111] Y. Robert,et al. Fault-Tolerance Techniques for High-Performance Computing , 2015, Computer Communications and Networks.
[112] Jacob A. Abraham,et al. Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.
[113] Katherine E. Isaacs,et al. There goes the neighborhood: Performance degradation due to nearby jobs , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[114] Yves Robert,et al. Fault-Tolerance Techniques for High-Performance Computing , 2015 .
[115] P. K. Smolarkiewicz,et al. VARIATIONAL METHODS FOR ELLIPTIC PROBLEMS IN FLUID MODELS , 2000 .
[116] Ravishankar K. Iyer,et al. Measuring and Understanding Extreme-Scale Application Resilience: A Field Study of 5,000,000 HPC Application Runs , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.
[117] Joanna Szmelter,et al. FVM 1.0: A nonhydrostatic finite-volume dynamical core formulation for IFS , 2018 .
[118] Peter D. Düben,et al. On the use of inexact, pruned hardware in atmospheric modelling , 2014, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.
[119] William Gropp,et al. Towards a More Complete Understanding of SDC Propagation , 2017, HPDC.
[120] J. Szmelter,et al. MPDATA: An edge-based unstructured-grid formulation , 2005 .
[121] Franck Cappello,et al. Identifying the Right Replication Level to Detect and Correct Silent Errors at Scale , 2017, FTXS '17.
[122] Guangyu Sun,et al. Exploring Memory Hierarchy Design with Emerging Memory Technologies , 2013, Lecture Notes in Electrical Engineering.
[123] Manish Parashar,et al. Specification of Fenix MPI Fault Tolerance library version 0.9. , 2016 .
[124] Franck Cappello,et al. Toward Exascale Resilience: 2014 update , 2014, Supercomput. Front. Innov..
[125] Luigi Carro,et al. Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[126] Song Fu,et al. Characterizing and Modeling Reliability of Declustered RAID for HPC Storage Systems , 2019, 2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks – Industry Track.
[127] Dejan S. Milojicic,et al. Optimizing Checkpoints Using NVM as Virtual Memory , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.
[128] Amin Ansari,et al. Shoestring: probabilistic soft error reliability on the cheap , 2010, ASPLOS XV.
[129] Franck Cappello,et al. Improving performance of iterative methods by lossy checkponting , 2018, HPDC.
[130] Michela Taufer,et al. On the Need for Reproducible Numerical Accuracy through Intelligent Runtime Selection of Reduction Algorithms at the Extreme Scale , 2015, 2015 IEEE International Conference on Cluster Computing.
[131] Chris D. Cantwell,et al. A Minimally Intrusive Low-Memory Approach to Resilience for Existing Transient Solvers , 2018, J. Sci. Comput..
[132] Bernd Mohr,et al. Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis , 2017 .
[133] Manish Parashar,et al. Scalable Failure Masking for Stencil Computations using Ghost Region Expansion and Cell to Rank Remapping , 2017, SIAM J. Sci. Comput..
[134] Eike Hermann Müller,et al. LFRic: Meeting the challenges of scalability and performance portability in Weather and Climate models , 2018, J. Parallel Distributed Comput..
[135] Timothy A. Davis,et al. The university of Florida sparse matrix collection , 2011, TOMS.
[136] Martin Schulz,et al. Evaluating and extending user-level fault tolerance in MPI applications , 2016, Int. J. High Perform. Comput. Appl..
[137] Micah Beck,et al. Compiler-Assisted Memory Exclusion for Fast Checkpointing , 1995 .
[138] Manish Parashar,et al. Modeling and Simulating Multiple Failure Masking Enabled by Local Recovery for Stencil-Based Applications at Extreme Scales , 2017, IEEE Transactions on Parallel and Distributed Systems.
[139] Rachid Guerraoui,et al. Software-Based Replication for Fault Tolerance , 1997, Computer.
[140] Peter D. Düben,et al. Rounding errors may be beneficial for simulations of atmospheric flow: results from the forced 1D Burgers equation , 2015 .
[141] Yuichi Inadomi,et al. Analyzing and mitigating the impact of manufacturing variability in power-constrained supercomputing , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.
[142] Peter D. Düben,et al. Reliable low precision simulations in land surface models , 2018, Climate Dynamics.
[143] Y. Saad,et al. GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems , 1986 .
[144] George Bosilca,et al. Fault tolerance of MPI applications in exascale systems: The ULFM solution , 2020, Future Gener. Comput. Syst..
[145] Barbara I. Wohlmuth,et al. Resilience for Massively Parallel Multigrid Solvers , 2016, SIAM J. Sci. Comput..
[146] Richard M. Karp,et al. Algorithmic methodologies for ultra-efficient inexact architectures for sustaining technology scaling , 2012, CF '12.
[147] Peter Bastian,et al. Generic implementation of finite element methods in the Distributed and Unified Numerics Environment (DUNE) , 2010, Kybernetika.
[148] Md. Mohsin Ali,et al. Application Level Fault Recovery: Using Fault-Tolerant Open MPI in a PDE Solver , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.
[149] John Sartori,et al. Stochastic computing: Embracing errors in architecture and design of processors and applications , 2011, 2011 Proceedings of the 14th International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES).
[150] 高等学校計算数学学報編輯委員会編,et al. 高等学校計算数学学報 = Numerical mathematics , 1979 .
[151] Dong Li,et al. Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[152] Robert E. Lyons,et al. The Use of Triple-Modular Redundancy to Improve Computer Reliability , 1962, IBM J. Res. Dev..
[153] Martin Leutbecher,et al. On the probabilistic skill of dual‐resolution ensemble forecasts , 2019, Quarterly Journal of the Royal Meteorological Society.
[154] Dominik Göddeke,et al. Soft fault detection and correction for multigrid , 2018, Int. J. High Perform. Comput. Appl..
[155] Andreas Dedner,et al. The Distributed and Unified Numerics Environment,Version 2.4 , 2016 .
[156] Michael A. Heroux,et al. Toward Local Failure Local Recovery Resilience Model using MPI-ULFM , 2014, EuroMPI/ASIA.
[157] William Gropp,et al. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , 2013, HiPC 2013.
[158] Dirk Ribbrock,et al. Fault-tolerant finite-element multigrid algorithms with hierarchically compressed asynchronous checkpointing , 2015, Parallel Comput..
[159] Franck Cappello,et al. Lightweight Silent Data Corruption Detection Based on Runtime Data Analysis for HPC Applications , 2015, HPDC.