software fault-tolerance, process replication failure masking, continuous availability, topology The ambition of fault-tolerant systems is to provide application transparent fault-tolerance at the same performance as a non-fault-tolerant system. Somersault is a library for developing distributed fault-tolerant software systems that comes close to achieving both goals. We describe Somersault and its properties, including: 1. Fault-tolerance — Somersault implements " process mirroring " within a group of processes called a recovery unit. Failure of individual group members is completely masked. 2. Abstraction — Somersault provides loss-less messaging between units. Recovery units and single processes are addressed uniformly as single entities. Recovery unit application code is unaware of replication. 3. High performance — The simple protocol provides throughput comparable to non-fault-tolerant processes at a low latency overhead. There is also sub-second failover time. 4. Compositionality — The same protocol is used to communicate between recovery units as between single processes, so any topology can be formed. 5. Scalability — Failure detection, failure recovery and general system performance are independent of the number of recovery units in a software system. Somersault has been developed at HP Laboratories. At the time of writing it is undergoing industrial trials. The ambition of fault-tolerant systems is to provide application transparent fault-tolerance at the same performance as a non-fault-tolerant system. Somersault is a library for developing distributed fault-tolerant software systems that comes close to achieving both goals. We describe Somersault and its properties, including: • Fault-tolerance – Somersault implements " process mirroring " within a group of processes called a recovery unit. Failure of individual group members is completely masked. • Abstraction – Somersault provides loss-less messaging between units. Recovery units and single processes are addressed uniformly as single entities. Recovery unit application code is unaware of replication. • High performance – The simple protocol provides throughput comparable to non-fault-tolerant processes at a low latency overhead. There is also sub-second failover time. • Compositionality – the same protocol is used to communicate between recovery units as between single processes, so any topology can be formed. • Scalability – failure detection, failure recovery and general system performance are independent of the number of recovery units in a software system. Somersault has been developed at HP laboratories. At the time of writing it is undergoing industrial trials.
Nancy A. Lynch,et al.
Impossibility of distributed consensus with one faulty process
PODS '83.
Keith Marzullo,et al.
Simulating fail-stop in asynchronous distributed systems
Proceedings of IEEE 13th Symposium on Reliable Distributed Systems.
K. H. Kim,et al.
An efficient decentralized approach to processor-group membership maintenance in real-time LAN systems: the PRHB/ED scheme
[1992] Proceedings 11th Symposium on Reliable Distributed Systems.
K. Birman,et al.
Understanding Partitions and the \ No Partition "
Ronald P. Bianchini,et al.
An Adaptive Distributed System-Level Diagnosis Algorithm and Its Implementation
Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..
J. D. Day,et al.
A principle for resilient sharing of distributed resources
ICSE '76.
Kenneth P. Birman,et al.
Understanding partitions and the 'no partition' assumption
1993 4th Workshop on Future Trends of Distributed Computing Systems.
Jeffrey F. Naughton,et al.
Low-Latency, Concurrent Checkpointing for Parallel Programs
IEEE Trans. Parallel Distributed Syst..
Fred B. Schneider,et al.
Byzantine generals in action: implementing fail-stop processors
Jim Lyon.
Tandem's remote data facility
Digest of Papers Compcon Spring '90. Thirty-Fifth IEEE Computer Society International Conference on Intellectual Leverage.
Che-Liang Yang,et al.
Hybrid Fault Diagnosability with Unreliable Communcation Links
IEEE Trans. Computers.
Flaviu Cristian,et al.
Agreeing on who is present and who is absent in a synchronous distributed system
[1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.
Thomas Becker.
Application-transparent fault tolerance in distributed systems
Proceedings of 2nd International Workshop on Configurable Distributed Systems.
Richard D. Schlichting,et al.
Fail-stop processors: an approach to designing fault-tolerant computing systems
Willy Zwaenepoel,et al.
The performance of consistent checkpointing
[1992] Proceedings 11th Symposium on Reliable Distributed Systems.
Sudhakar M. Reddy,et al.
On Self-Fault Diagnosis of the Distributed Systems
IEEE Trans. Computers.