MPICH-GF: Transparent Checkpointing and Rollback-Recovery for Grid-Enabled MPI Processes
暂无分享,去创建一个
Heon Young Yeom | Taesoon Park | Hyungsoo Jung | Namyoon Woo | Hyungwoo Park | T. Park | H. Yeom | Hyungsoo Jung | Namyoon Woo | Hyungwook Park
[1] Leslie Lamport,et al. Distributed snapshots: determining global states of distributed systems , 1985, TOCS.
[2] RICHARD KOO,et al. Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.
[3] Ewing L. Lusk,et al. Monitors, Messages, and Clusters: The p4 Parallel Programming System , 1994, Parallel Comput..
[4] Message P Forum,et al. MPI: A Message-Passing Interface Standard , 1994 .
[5] Forum Mpi. MPI: A Message-Passing Interface , 1994 .
[6] Message Passing Interface Forum. MPI: A message - passing interface standard , 1994 .
[7] Jian Xu,et al. Necessary and Sufficient Conditions for Consistent Global Snapshots , 1995, IEEE Trans. Parallel Distributed Syst..
[8] Kai Li,et al. Libckpt: Transparent Checkpointing under Unix Error Correction: Libckpt: Transparent Checkpointing under Unix , 1995 .
[9] Georg Stellner,et al. CoCheck: checkpointing and process migration for MPI , 1996, Proceedings of International Conference on Parallel Processing.
[10] Anthony Skjellum,et al. A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..
[11] Jyh-Jong Tsay,et al. Checkpointing Message-Passing Interface (MPI) parallel programs , 1997, Proceedings Pacific Rim International Symposium on Fault-Tolerant Systems.
[12] Erik Seligman,et al. Application Level Fault Tolerance in Heterogenous Networks of Workstations , 1997, J. Parallel Distributed Comput..
[13] Kai Li,et al. CLIP: A Checkpointing Tool for Message Passing Parallel Programs , 1997, ACM/IEEE SC 1997 Conference (SC'97).
[14] Ian T. Foster,et al. A Grid-Enabled MPI: Message Passing in Heterogeneous Distributed Computing Systems , 1998, Proceedings of the IEEE/ACM SC98 Conference.
[15] Nuno Neves,et al. RENEW: a tool for fast and efficient implementation of checkpoint protocols , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).
[16] L. Alvisi,et al. Message Logging: Pessimistic, Optimistic, Causal, and Optimal , 1998, IEEE Trans. Software Eng..
[17] Luís Moura Silva,et al. System-level versus user-defined checkpointing , 1998, Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No.98CB36281).
[18] Kai Li,et al. Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..
[19] Jonathan Robinson,et al. The Hector Distributed Run-Time Environment , 1998, IEEE Trans. Parallel Distributed Syst..
[20] Ian T. Foster,et al. The Globus project: a status report , 1998, Proceedings Seventh Heterogeneous Computing Workshop (HCW'98).
[21] Sy-Yen Kuo,et al. Theoretical Analysis for Communication-Induced Checkpointing Protocols with Rollback-Dependency Trackability , 1998, IEEE Trans. Parallel Distributed Syst..
[22] Michael Litzkow,et al. Supporting checkpointing and process migration outside the UNIX kernel , 1999 .
[23] Lorenzo Alvisi,et al. An analysis of communication induced checkpointing , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).
[24] Wei-Jih Li,et al. Checkpointing message passing interface (MPI) parallel programs , 1999 .
[25] Roy Friedman,et al. Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).
[26] Miron Livny,et al. Process hijacking , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).
[27] Harrick M. Vin,et al. Egida: an extensible toolkit for low-overhead fault-tolerance , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).
[28] Andrew S. Grimshaw,et al. Integrating fault-tolerance techniques in grid applications , 2000 .
[29] Adrianos Lachanas,et al. MPI-FT: Portable Fault Tolerance Scheme for MPI , 2000, Parallel Process. Lett..
[30] William G. Tuel,et al. Parallel checkpoint/restart without message logging , 2000, Proceedings 2000. International Workshop on Parallel Processing.
[31] Jack J. Dongarra,et al. FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World , 2000, PVM/MPI.
[32] Anthony Skjellum,et al. MPI/FT/sup TM/: architecture and taxonomies for fault-tolerant, message-passing middleware for performance-portable parallel computing , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.
[33] Ian T. Foster,et al. The Anatomy of the Grid: Enabling Scalable Virtual Organizations , 2001, Int. J. High Perform. Comput. Appl..
[34] Greg Burns,et al. LAM: An Open Cluster Environment for MPI , 2002 .
[35] L. Alvisi,et al. A Survey of Rollback-Recovery Protocols , 2002 .
[36] Viet D. Tran,et al. Application Recovery in Parallel Programming Environment , 2002, PVM/MPI.
[37] Thomas Hérault,et al. MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes , 2002, ACM/IEEE SC 2002 Conference (SC'02).
[38] Ian T. Foster,et al. Condor-G: A Computation Management Agent for Multi-Institutional Grids , 2004, Cluster Computing.