论文信息 - Somersault Software Fault-Tolerance

Somersault Software Fault-Tolerance

software fault-tolerance, process replication failure masking, continuous availability, topology The ambition of fault-tolerant systems is to provide application transparent fault-tolerance at the same performance as a non-fault-tolerant system. Somersault is a library for developing distributed fault-tolerant software systems that comes close to achieving both goals. We describe Somersault and its properties, including: 1. Fault-tolerance — Somersault implements " process mirroring " within a group of processes called a recovery unit. Failure of individual group members is completely masked. 2. Abstraction — Somersault provides loss-less messaging between units. Recovery units and single processes are addressed uniformly as single entities. Recovery unit application code is unaware of replication. 3. High performance — The simple protocol provides throughput comparable to non-fault-tolerant processes at a low latency overhead. There is also sub-second failover time. 4. Compositionality — The same protocol is used to communicate between recovery units as between single processes, so any topology can be formed. 5. Scalability — Failure detection, failure recovery and general system performance are independent of the number of recovery units in a software system. Somersault has been developed at HP Laboratories. At the time of writing it is undergoing industrial trials. The ambition of fault-tolerant systems is to provide application transparent fault-tolerance at the same performance as a non-fault-tolerant system. Somersault is a library for developing distributed fault-tolerant software systems that comes close to achieving both goals. We describe Somersault and its properties, including: • Fault-tolerance – Somersault implements " process mirroring " within a group of processes called a recovery unit. Failure of individual group members is completely masked. • Abstraction – Somersault provides loss-less messaging between units. Recovery units and single processes are addressed uniformly as single entities. Recovery unit application code is unaware of replication. • High performance – The simple protocol provides throughput comparable to non-fault-tolerant processes at a low latency overhead. There is also sub-second failover time. • Compositionality – the same protocol is used to communicate between recovery units as between single processes, so any topology can be formed. • Scalability – failure detection, failure recovery and general system performance are independent of the number of recovery units in a software system. Somersault has been developed at HP laboratories. At the time of writing it is undergoing industrial trials.

Paul Vickers | Paul Murray | Roger A. Fleming | Paul D. Harry

[1] Nancy A. Lynch,et al. Impossibility of distributed consensus with one faulty process , 1983, PODS '83.

[2] Keith Marzullo,et al. Simulating fail-stop in asynchronous distributed systems , 1994, Proceedings of IEEE 13th Symposium on Reliable Distributed Systems.

[3] K. H. Kim,et al. An efficient decentralized approach to processor-group membership maintenance in real-time LAN systems: the PRHB/ED scheme , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[4] K. Birman,et al. Understanding Partitions and the \ No Partition " , 1993 .

[5] Ronald P. Bianchini,et al. An Adaptive Distributed System-Level Diagnosis Algorithm and Its Implementation , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[6] J. D. Day,et al. A principle for resilient sharing of distributed resources , 1976, ICSE '76.

[7] Kenneth P. Birman,et al. Understanding partitions and the 'no partition' assumption , 1993, 1993 4th Workshop on Future Trends of Distributed Computing Systems.

[8] Jeffrey F. Naughton,et al. Low-Latency, Concurrent Checkpointing for Parallel Programs , 1994, IEEE Trans. Parallel Distributed Syst..

[9] Fred B. Schneider,et al. Byzantine generals in action: implementing fail-stop processors , 1984, TOCS.

[10] Jim Lyon. Tandem's remote data facility , 1990, Digest of Papers Compcon Spring '90. Thirty-Fifth IEEE Computer Society International Conference on Intellectual Leverage.

[11] Che-Liang Yang,et al. Hybrid Fault Diagnosability with Unreliable Communcation Links , 1988, IEEE Trans. Computers.

[12] Flaviu Cristian,et al. Agreeing on who is present and who is absent in a synchronous distributed system , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[13] Thomas Becker. Application-transparent fault tolerance in distributed systems , 1994, Proceedings of 2nd International Workshop on Configurable Distributed Systems.

[14] Richard D. Schlichting,et al. Fail-stop processors: an approach to designing fault-tolerant computing systems , 1983, TOCS.

[15] Willy Zwaenepoel,et al. The performance of consistent checkpointing , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[16] Sudhakar M. Reddy,et al. On Self-Fault Diagnosis of the Distributed Systems , 1988, IEEE Trans. Computers.