software fault-tolerance, process replication failure masking, continuous availability, topology The ambition of fault-tolerant systems is to provide application transparent fault-tolerance at the same performance as a non-fault-tolerant system. Somersault is a library for developing distributed fault-tolerant software systems that comes close to achieving both goals. We describe Somersault and its properties, including: 1. Fault-tolerance — Somersault implements " process mirroring " within a group of processes called a recovery unit. Failure of individual group members is completely masked. 2. Abstraction — Somersault provides loss-less messaging between units. Recovery units and single processes are addressed uniformly as single entities. Recovery unit application code is unaware of replication. 3. High performance — The simple protocol provides throughput comparable to non-fault-tolerant processes at a low latency overhead. There is also sub-second failover time. 4. Compositionality — The same protocol is used to communicate between recovery units as between single processes, so any topology can be formed. 5. Scalability — Failure detection, failure recovery and general system performance are independent of the number of recovery units in a software system. Somersault has been developed at HP Laboratories. At the time of writing it is undergoing industrial trials. The ambition of fault-tolerant systems is to provide application transparent fault-tolerance at the same performance as a non-fault-tolerant system. Somersault is a library for developing distributed fault-tolerant software systems that comes close to achieving both goals. We describe Somersault and its properties, including: • Fault-tolerance – Somersault implements " process mirroring " within a group of processes called a recovery unit. Failure of individual group members is completely masked. • Abstraction – Somersault provides loss-less messaging between units. Recovery units and single processes are addressed uniformly as single entities. Recovery unit application code is unaware of replication. • High performance – The simple protocol provides throughput comparable to non-fault-tolerant processes at a low latency overhead. There is also sub-second failover time. • Compositionality – the same protocol is used to communicate between recovery units as between single processes, so any topology can be formed. • Scalability – failure detection, failure recovery and general system performance are independent of the number of recovery units in a software system. Somersault has been developed at HP laboratories. At the time of writing it is undergoing industrial trials.
[1]
Nancy A. Lynch,et al.
Impossibility of distributed consensus with one faulty process
,
1983,
PODS '83.
[2]
Keith Marzullo,et al.
Simulating fail-stop in asynchronous distributed systems
,
1994,
Proceedings of IEEE 13th Symposium on Reliable Distributed Systems.
[3]
K. H. Kim,et al.
An efficient decentralized approach to processor-group membership maintenance in real-time LAN systems: the PRHB/ED scheme
,
1992,
[1992] Proceedings 11th Symposium on Reliable Distributed Systems.
[4]
K. Birman,et al.
Understanding Partitions and the \ No Partition "
,
1993
.
[5]
Ronald P. Bianchini,et al.
An Adaptive Distributed System-Level Diagnosis Algorithm and Its Implementation
,
1995,
Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..
[6]
J. D. Day,et al.
A principle for resilient sharing of distributed resources
,
1976,
ICSE '76.
[7]
Kenneth P. Birman,et al.
Understanding partitions and the 'no partition' assumption
,
1993,
1993 4th Workshop on Future Trends of Distributed Computing Systems.
[8]
Jeffrey F. Naughton,et al.
Low-Latency, Concurrent Checkpointing for Parallel Programs
,
1994,
IEEE Trans. Parallel Distributed Syst..
[9]
Fred B. Schneider,et al.
Byzantine generals in action: implementing fail-stop processors
,
1984,
TOCS.
[10]
Jim Lyon.
Tandem's remote data facility
,
1990,
Digest of Papers Compcon Spring '90. Thirty-Fifth IEEE Computer Society International Conference on Intellectual Leverage.
[11]
Che-Liang Yang,et al.
Hybrid Fault Diagnosability with Unreliable Communcation Links
,
1988,
IEEE Trans. Computers.
[12]
Flaviu Cristian,et al.
Agreeing on who is present and who is absent in a synchronous distributed system
,
1988,
[1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.
[13]
Thomas Becker.
Application-transparent fault tolerance in distributed systems
,
1994,
Proceedings of 2nd International Workshop on Configurable Distributed Systems.
[14]
Richard D. Schlichting,et al.
Fail-stop processors: an approach to designing fault-tolerant computing systems
,
1983,
TOCS.
[15]
Willy Zwaenepoel,et al.
The performance of consistent checkpointing
,
1992,
[1992] Proceedings 11th Symposium on Reliable Distributed Systems.
[16]
Sudhakar M. Reddy,et al.
On Self-Fault Diagnosis of the Distributed Systems
,
1988,
IEEE Trans. Computers.