Proactive Fault Tolerance in Large Systems
暂无分享,去创建一个
Laxmikant V. Kale | Sayantan Chakravorty | Celso L. Mendes | L. Kalé | C. Mendes | Sayantan Chakravorty
[1] B. Bouteiller,et al. MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging , 2003, ACM/IEEE SC 2003 Conference (SC'03).
[2] William Gropp,et al. Fault Tolerance in Message Passing Interface Programs , 2004, Int. J. High Perform. Comput. Appl..
[3] Laxmikant V. Kalé,et al. Scaling Molecular Dynamics to 3000 Processors with Projections: A Performance Analysis Case Study , 2003, International Conference on Computational Science.
[4] Bruce Allen,et al. Monitoring hard disks with smart , 2004 .
[5] Georg Stellner,et al. CoCheck: checkpointing and process migration for MPI , 1996, Proceedings of International Conference on Parallel Processing.
[6] Charles L. Seitz,et al. Myrinet: A Gigabit-per-Second Local Area Network , 1995, IEEE Micro.
[7] Robert E. Strom,et al. Optimistic recovery in distributed systems , 1985, TOCS.
[8] Roy Friedman,et al. Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations , 2004, Cluster Computing.
[9] Anthony Skjellum,et al. MPI/FT/sup TM/: architecture and taxonomies for fault-tolerant, message-passing middleware for performance-portable parallel computing , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.
[10] Roy Friedman,et al. Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).
[11] Adrianos Lachanas,et al. MPI-FT: Portable Fault Tolerance Scheme for MPI , 2000, Parallel Process. Lett..
[12] Laxmikant V. Kalé,et al. FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).
[13] Willy Zwaenepoel,et al. Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit , 1992, IEEE Trans. Computers.
[14] Chao Huang. SYSTEM SUPPORT FOR CHECKPOINT AND RESTART OF CHARM++ AND AMPI APPLICATIONS , 2004 .
[15] Kai Li,et al. CLIP: A Checkpointing Tool for Message Passing Parallel Programs , 1997, ACM/IEEE SC 1997 Conference (SC'97).
[16] Laxmikant V. Kalé,et al. Supporting dynamic parallel object arrays , 2003, Concurr. Comput. Pract. Exp..
[17] Jack J. Dongarra,et al. Building and Using a Fault-Tolerant MPI Implementation , 2004, Int. J. High Perform. Comput. Appl..
[18] Laxmikant V. Kale,et al. Performance and Productivity in Parallel Programming via Processor Virtualization , 2004 .