Practical resilient cases for FA-MPI, a transactional fault-tolerant MPI
暂无分享,去创建一个
[1] James H. Laros,et al. Evaluating the viability of process replication reliability for exascale systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[2] Ross J. Roeser,et al. Updates and changes , 2012 .
[3] Nancy Wilkins-Diehr,et al. XSEDE: Accelerating Scientific Discovery , 2014, Computing in Science & Engineering.
[4] Leslie G. Valiant,et al. A bridging model for parallel computation , 1990, CACM.
[5] Anthony Skjellum,et al. Comparing, Contrasting, Generalizing, and Integrating Two Current Designs for Fault-Tolerant MPI , 2014, EuroMPI/ASIA.
[6] Anthony Skjellum,et al. Design and Evaluation of FA-MPI, a Transactional Resilience Scheme for Non-blocking MPI , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.
[7] Jinsuk Chung,et al. Containment domains: a scalable, efficient, and flexible resilience scheme for exascale systems , 2012, HiPC 2012.
[8] Jason Duell,et al. The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing , 2005, Int. J. High Perform. Comput. Appl..
[9] Michael A. Heroux,et al. Toward Local Failure Local Recovery Resilience Model using MPI-ULFM , 2014, EuroMPI/ASIA.
[10] Md. Mohsin Ali,et al. Application Level Fault Recovery: Using Fault-Tolerant Open MPI in a PDE Solver , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.
[11] Bianca Schroeder,et al. Understanding failures in petascale computers , 2007 .
[12] Greg Bronevetsky,et al. Run-Through Stabilization: An MPI Proposal for Process Fault Tolerance , 2011, EuroMPI.
[13] Bronis R. de Supinski,et al. Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[14] Sandia Report,et al. Improving Performance via Mini-applications , 2009 .
[15] Franck Cappello,et al. FTI: High performance Fault Tolerance Interface for hybrid systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[16] Thomas Hérault,et al. Post-failure recovery of MPI communication capability , 2013, Int. J. High Perform. Comput. Appl..
[17] David Fiala. Detection and correction of silent data corruption for large-scale high-performance computing , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.
[18] Anthony Skjellum,et al. MPI/FT: A Model-Based Approach to Low-Overhead Fault Tolerant Message-Passing Middleware , 2004, Cluster Computing.
[19] Ian Karlin,et al. LULESH 2.0 Updates and Changes , 2013 .
[20] George Bosilca,et al. Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.
[21] Philip A. Bernstein,et al. Concurrency Control in Distributed Database Systems , 1986, CSUR.