FMI: Fault Tolerant Messaging Interface for Fast and Transparent Recovery
暂无分享,去创建一个
Satoshi Matsuoka | Bronis R. de Supinski | Todd Gamblin | Adam Moody | Kento Sato | Kathryn Mohror | Naoya Maruyama
[1] Thomas Hérault,et al. An evaluation of User-Level Failure Mitigation support in MPI , 2012, Computing.
[2] David R. Karger,et al. Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.
[3] Bronis R. de Supinski,et al. Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[4] Chin-Long Chen,et al. Error-Correcting Codes for Semiconductor Memory Applications: A State-of-the-Art Review , 1984, IBM J. Res. Dev..
[5] Randy H. Katz,et al. A case for redundant arrays of inexpensive disks (RAID) , 1988, SIGMOD '88.
[6] Brian W. Barrett,et al. An Evaluation of Open MPI's Matching Transport Layer on the Cray XT , 2007, PVM/MPI.
[7] Franck Cappello,et al. Distributed Diskless Checkpoint for Large Scale Systems , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.
[8] Kamil Iskra,et al. ZOID: I/O-forwarding infrastructure for petascale architectures , 2008, PPoPP.
[9] Jack J. Dongarra,et al. FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World , 2000, PVM/MPI.
[10] Dhabaleswar K. Panda,et al. Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand , 2006, 2006 International Conference on Parallel Processing (ICPP'06).
[11] Bianca Schroeder,et al. Disk Failures in the Real World: What Does an MTTF of 1, 000, 000 Hours Mean to You? , 2007, FAST.
[12] Bronis R. de Supinski,et al. Design and modeling of non-blocking checkpoint system , 2012, HiPC 2012.
[13] Kai Li,et al. Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..
[14] Jason Duell,et al. The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing , 2005, Int. J. High Perform. Comput. Appl..
[15] Laxmikant V. Kalé,et al. Adaptive MPI , 2003, LCPC.
[16] Franck Cappello,et al. FTI: High performance Fault Tolerance Interface for hybrid systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[17] Christian Engelmann,et al. Development of Naturally Fault Tolerant Algorithms for Computing on 100,000 Processors , 2002 .
[18] Nitin H. Vaidya,et al. On Checkpoint Latency , 1995 .
[19] Andy B. Yoo,et al. Approved for Public Release; Further Dissemination Unlimited X-ray Pulse Compression Using Strained Crystals X-ray Pulse Compression Using Strained Crystals , 2002 .
[20] D. Quinlan,et al. Inter-Agency Workshop on HPC Resilience at Extreme Scale National Security Agency Advanced Computing Systems February 21 – 24 , 2012 Coordinating Representatives John Daly ( DOD ) Bill Harrod ( DOE / SC ) Thuc Hoang ( DOE / NNSA , 2012 .
[21] Bianca Schroeder,et al. Understanding failures in petascale computers , 2007 .
[22] Satoshi Matsuoka,et al. Design and modeling of a non-blocking checkpointing system , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.
[23] Laxmikant V. Kalé,et al. FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).