Algorithm-based diskless checkpointing for fault tolerant matrix operations
暂无分享,去创建一个
[1] W. Kent Fuchs,et al. Lazy checkpoint coordination for bounding rollback propagation , 1992, Proceedings of 1993 IEEE 12th Symposium on Reliable Distributed Systems.
[2] David Gelernter,et al. Supercomputing out of recycled garbage: preliminary experience with Piranha , 1992, ICS '92.
[3] W. Kent Fuchs,et al. Optimistic message logging for independent checkpointing in message-passing systems , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.
[4] David Cummings,et al. Checkpoint/rollback in a distributed system using coarse-grained dataflow , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.
[5] Kai Li,et al. Faster checkpointing with N+1 parity , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.
[6] Randy H. Katz,et al. Failure correction techniques for large disk arrays , 1989, ASPLOS III.
[7] Willy Zwaenepoel,et al. The performance of consistent checkpointing , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.
[8] Jack Dongarra,et al. Heterogeneous network computing , 1991 .
[9] Flaviu Cristian,et al. A timestamp-based checkpointing protocol for long-lived distributed computations , 1991, [1991] Proceedings Tenth Symposium on Reliable Distributed Systems.
[10] Walter A. Burkhard,et al. Disk array storage system reliability , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.
[11] Ten-Hwang Lai,et al. On Distributed Snapshots , 1987, Inf. Process. Lett..
[12] Yuval Tamir,et al. ERROR RECOVERY IN MULTICOMPUTERS USING GLOBAL CHECKPOINTS , 1984 .
[13] M. Moura Silva,et al. Checkpointing SPMD applications on transputer networks , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.
[14] Willy Zwaenepoel,et al. On the use and implementation of message logging , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.
[15] Willy Zwaenepoel,et al. Measured Performance of Consistent Checkpointing , 1992 .
[16] E. N. Elnozahy,et al. Replicated distributed processes in Manetho , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.
[17] Henri E. Bal,et al. Transparent fault-tolerance in parallel Orca programs , 1992 .
[18] Nitin H. Vaidya. Consistent Logical Checkpointing , 1994 .
[19] RICHARD KOO,et al. Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.
[20] Kai Li,et al. ickp: a consistent checkpointer for multicomputers , 1994, IEEE Parallel & Distributed Technology: Systems & Applications.
[21] David B. Johnsonandwillyzwaenepoel. Recovery in Distributed Systems Using Optimistic Message Logging and Checkpointing , 1990 .
[22] David B. Johnson,et al. Recovery in Distributed Systems Using Optimistic Message Logging and Checkpointing , 1988, J. Algorithms.
[23] Anita Borg,et al. A message system supporting fault tolerance , 1983, SOSP '83.
[24] M. Blaum,et al. EVENODD: an optimal scheme for tolerating double disk failures in RAID architectures , 1994, Proceedings of 21 International Symposium on Computer Architecture.
[25] Wolfgang Graetsch,et al. Fault tolerance under UNIX , 1989, TOCS.
[26] Miroslaw Malek,et al. Space/Time Overhead Analysis and Experiments with Techniques for Fault Tolerance , 1993 .
[27] Jack J. Dongarra,et al. Solving linear systems on vector and shared memory computers , 1990 .
[28] Jian Xu,et al. Adaptive independent checkpointing for reducing rollback propagation , 1993, Proceedings of 1993 5th IEEE Symposium on Parallel and Distributed Processing.
[29] Jack Dongarra,et al. ScaLAPACK: a scalable linear algebra library for distributed memory concurrent computers , 1992, [Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation.
[30] James R. Russell,et al. Optimistic failure recovery for very large networks , 1991, [1991] Proceedings Tenth Symposium on Reliable Distributed Systems.
[31] Richard Barrett,et al. Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods , 1994, Other Titles in Applied Mathematics.
[32] Amber Roy-Chowdhury,et al. Algorithm-based fault location and recovery for matrix computations , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.
[33] Leslie Lamport,et al. Distributed snapshots: determining global states of distributed systems , 1985, TOCS.
[34] Jack Dongarra,et al. PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing , 1995 .
[35] Jack Dongarra,et al. Pvm: A Users' Guide and Tutorial for Network Parallel Computing , 1994 .
[36] Kenneth P. Birman,et al. Exploiting replication in distributed systems , 1990 .
[37] Peter Steenkiste,et al. Fail-Safe PVM: A Portable Package for Distributed Programming with Transparent Recovery , 1993 .
[38] Erik Seligman,et al. Dome: Distributed Object Migration Environment , 1994 .
[39] Phil Kearns,et al. Rollback based on vector time , 1993, Proceedings of 1993 IEEE 12th Symposium on Reliable Distributed Systems.
[40] Franklin T. Luk,et al. An Analysis of Algorithm-Based Fault Tolerance Techniques , 1988, J. Parallel Distributed Comput..
[41] Richard D. Schlichting,et al. Supporting Fault-Tolerant Parallel Programming in Linda , 1995, IEEE Trans. Parallel Distributed Syst..
[42] Jacob A. Abraham,et al. Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.
[43] Kai Li,et al. Retrospective: virtual memory mapped network interface for the SHRIMP multicomputer , 1994, ISCA '98.
[44] Kai Li,et al. A Failure Correction Technique for Parallel Storage Devices with Minimal Device Overhead , 1994 .
[45] Robert E. Strom,et al. Optimistic recovery in distributed systems , 1985, TOCS.
[46] Jeffrey F. Naughton,et al. Short Notes Low-Latency, Concurrent Checkpointing for Parallel Programs , 1994 .