Somersault Software Fault-Tolerance

software fault-tolerance, process replication failure masking, continuous availability, topology The ambition of fault-tolerant systems is to provide application transparent fault-tolerance at the same performance as a non-fault-tolerant system. Somersault is a library for developing distributed fault-tolerant software systems that comes close to achieving both goals. We describe Somersault and its properties, including: 1. Fault-tolerance — Somersault implements " process mirroring " within a group of processes called a recovery unit. Failure of individual group members is completely masked. 2. Abstraction — Somersault provides loss-less messaging between units. Recovery units and single processes are addressed uniformly as single entities. Recovery unit application code is unaware of replication. 3. High performance — The simple protocol provides throughput comparable to non-fault-tolerant processes at a low latency overhead. There is also sub-second failover time. 4. Compositionality — The same protocol is used to communicate between recovery units as between single processes, so any topology can be formed. 5. Scalability — Failure detection, failure recovery and general system performance are independent of the number of recovery units in a software system. Somersault has been developed at HP Laboratories. At the time of writing it is undergoing industrial trials. The ambition of fault-tolerant systems is to provide application transparent fault-tolerance at the same performance as a non-fault-tolerant system. Somersault is a library for developing distributed fault-tolerant software systems that comes close to achieving both goals. We describe Somersault and its properties, including: • Fault-tolerance – Somersault implements " process mirroring " within a group of processes called a recovery unit. Failure of individual group members is completely masked. • Abstraction – Somersault provides loss-less messaging between units. Recovery units and single processes are addressed uniformly as single entities. Recovery unit application code is unaware of replication. • High performance – The simple protocol provides throughput comparable to non-fault-tolerant processes at a low latency overhead. There is also sub-second failover time. • Compositionality – the same protocol is used to communicate between recovery units as between single processes, so any topology can be formed. • Scalability – failure detection, failure recovery and general system performance are independent of the number of recovery units in a software system. Somersault has been developed at HP laboratories. At the time of writing it is undergoing industrial trials.

[1]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1983, PODS '83.

[2]  Keith Marzullo,et al.  Simulating fail-stop in asynchronous distributed systems , 1994, Proceedings of IEEE 13th Symposium on Reliable Distributed Systems.

[3]  K. H. Kim,et al.  An efficient decentralized approach to processor-group membership maintenance in real-time LAN systems: the PRHB/ED scheme , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[4]  K. Birman,et al.  Understanding Partitions and the \ No Partition " , 1993 .

[5]  Ronald P. Bianchini,et al.  An Adaptive Distributed System-Level Diagnosis Algorithm and Its Implementation , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[6]  J. D. Day,et al.  A principle for resilient sharing of distributed resources , 1976, ICSE '76.

[7]  Kenneth P. Birman,et al.  Understanding partitions and the 'no partition' assumption , 1993, 1993 4th Workshop on Future Trends of Distributed Computing Systems.

[8]  Jeffrey F. Naughton,et al.  Low-Latency, Concurrent Checkpointing for Parallel Programs , 1994, IEEE Trans. Parallel Distributed Syst..

[9]  Fred B. Schneider,et al.  Byzantine generals in action: implementing fail-stop processors , 1984, TOCS.

[10]  Jim Lyon Tandem's remote data facility , 1990, Digest of Papers Compcon Spring '90. Thirty-Fifth IEEE Computer Society International Conference on Intellectual Leverage.

[11]  Che-Liang Yang,et al.  Hybrid Fault Diagnosability with Unreliable Communcation Links , 1988, IEEE Trans. Computers.

[12]  Flaviu Cristian,et al.  Agreeing on who is present and who is absent in a synchronous distributed system , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[13]  Thomas Becker Application-transparent fault tolerance in distributed systems , 1994, Proceedings of 2nd International Workshop on Configurable Distributed Systems.

[14]  Richard D. Schlichting,et al.  Fail-stop processors: an approach to designing fault-tolerant computing systems , 1983, TOCS.

[15]  Willy Zwaenepoel,et al.  The performance of consistent checkpointing , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[16]  Sudhakar M. Reddy,et al.  On Self-Fault Diagnosis of the Distributed Systems , 1988, IEEE Trans. Computers.