Fault tolerance of MPI applications in exascale systems: The ULFM solution
暂无分享,去创建一个
George Bosilca | Nuria Losada | Patricia González | María J. Martín | Keita Teranishi | Aurélien Bouteiller | Aurélien Bouteiller | G. Bosilca | K. Teranishi | P. González | María J. Martín | Nuria Losada
[1] Gerhard Wellein,et al. CRAFT: A Library for Easier Application-Level Checkpoint/Restart and Automatic Fault Tolerance , 2017, IEEE Transactions on Parallel and Distributed Systems.
[2] Nuria Losada,et al. Assessing resilient versus stop-and-restart fault-tolerant solutions in MPI applications , 2016, The Journal of Supercomputing.
[3] Markus Hegland,et al. Complex scientific applications made fault-tolerant with the sparse grid combination technique , 2016, Int. J. High Perform. Comput. Appl..
[4] George Bosilca,et al. Local rollback for resilient MPI applications with application-level checkpointing and message logging , 2019, Future Gener. Comput. Syst..
[5] Vivek Sarkar,et al. X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.
[6] Peter Arbenz,et al. Intrinsic fault tolerance of multilevel Monte Carlo methods , 2015, J. Parallel Distributed Comput..
[7] Ravishankar K. Iyer,et al. Measuring the Resiliency of Extreme-Scale Computing Environments , 2016 .
[8] Jack Dongarra,et al. Redesigning the message logging model for high performance , 2010, ISC 2010.
[9] Andy B. Yoo,et al. Approved for Public Release; Further Dissemination Unlimited X-ray Pulse Compression Using Strained Crystals X-ray Pulse Compression Using Strained Crystals , 2002 .
[10] Bert J. Debusschere,et al. Application Fault Tolerance for Shrinking Resources via the Sparse Grid Combination Technique , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).
[11] Martin Schulz,et al. Evaluating and extending user-level fault tolerance in MPI applications , 2016, Int. J. High Perform. Comput. Appl..
[12] Yves Robert,et al. Fault-Tolerance Techniques for High-Performance Computing , 2015 .
[13] Suo Guang,et al. NR-MPI: A Non-stop and Fault Resilient MPI Supporting Programmer Defined Data Backup and Restore for E-scale Super Computing Systems , 2016, Supercomput. Front. Innov..
[14] Martin Schulz,et al. Evaluating User-Level Fault Tolerance for MPI Applications , 2014, EuroMPI/ASIA.
[15] Cosmin Safta,et al. ULFM-MPI Implementation of a Resilient Task-Based Partial Differential Equations Preconditioner , 2016, FTXS@HPDC.
[16] Thomas Hérault,et al. Post-failure recovery of MPI communication capability , 2013, Int. J. High Perform. Comput. Appl..
[17] Thomas Hérault,et al. MPICH-V Project: A Multiprotocol Automatic Fault-Tolerant MPI , 2006, Int. J. High Perform. Comput. Appl..
[18] Christian Engelmann,et al. Shrink or Substitute: Handling Process Failures in HPC Systems Using In-Situ Recovery , 2018, 2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP).
[19] Martin Schulz,et al. A Global Exception Fault Tolerance Model for MPI , 2014 .
[20] Franck Cappello,et al. Addressing failures in exascale computing , 2014, Int. J. High Perform. Comput. Appl..
[21] Michael A. Heroux,et al. Toward Local Failure Local Recovery Resilience Model using MPI-ULFM , 2014, EuroMPI/ASIA.
[22] Dinshaw S. Balsara,et al. Resilient computational applications using Coarray Fortran , 2019, Parallel Comput..
[23] George Bosilca,et al. Asynchronous Receiver-Driven Replay for Local Rollback of MPI Applications , 2019, 2019 IEEE/ACM 9th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS).
[24] Chris D. Cantwell,et al. A Minimally Intrusive Low-Memory Approach to Resilience for Existing Transient Solvers , 2018, J. Sci. Comput..
[25] Dhabaleswar K. Panda,et al. EReinit: Scalable and efficient fault‐tolerance for bulk‐synchronous MPI applications , 2018, Concurr. Comput. Pract. Exp..
[26] Satoshi Matsuoka,et al. FMI: Fault Tolerant Messaging Interface for Fast and Transparent Recovery , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.
[27] Carl E. Landwehr,et al. Basic concepts and taxonomy of dependable and secure computing , 2004, IEEE Transactions on Dependable and Secure Computing.
[28] Xiangke Liao,et al. NR-MPI: A Non-stop and Fault Resilient MPI , 2013, ICPADS 2013.
[29] Thomas Hérault,et al. A failure detector for HPC platforms , 2018, Int. J. High Perform. Comput. Appl..
[30] Manish Parashar,et al. Modeling and Simulating Multiple Failure Masking Enabled by Local Recovery for Stencil-Based Applications at Extreme Scales , 2017, IEEE Transactions on Parallel and Distributed Systems.
[31] Bronis R. de Supinski,et al. Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[32] Alan D. George,et al. FEMPI: A Lightweight Fault-tolerant MPI for Embedded Cluster Systems , 2006, ESA.
[33] Robert W. Numrich,et al. Co-array Fortran for parallel programming , 1998, FORF.
[34] Peter Arbenz,et al. A fault tolerant implementation of Multi-Level Monte Carlo methods , 2013, ParCo 2013.
[35] Pavan Balaji,et al. Simplifying the Recovery Model of User-Level Failure Mitigation , 2014, 2014 Workshop on Exascale MPI at Supercomputing Conference.
[36] Adrianos Lachanas,et al. MPI-FT: Portable Fault Tolerance Scheme for MPI , 2000, Parallel Process. Lett..
[37] Thomas Hérault,et al. Practical scalable consensus for pseudo-synchronous distributed systems , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.
[38] Jack J. Dongarra,et al. FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World , 2000, PVM/MPI.
[39] Anthony Skjellum,et al. Design and Evaluation of FA-MPI, a Transactional Resilience Scheme for Non-blocking MPI , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.
[40] Franck Cappello,et al. Toward Exascale Resilience: 2014 update , 2014, Supercomput. Front. Innov..
[41] Srikumar Venugopal,et al. Architecting Malleable MPI Applications for Priority-driven Adaptive Scheduling , 2016, EuroMPI.
[42] Aurelien Bouteiller,et al. PMIx: Process management for exascale environments , 2018, Parallel Comput..