Crash and Authenticated Byzantine Fault Tolerance: A Fail Signaling Approach

Group communication middlewaresystems are particularly useful in supporting replication and thus in building dependable services. Many such systemshave been implemented assuming crash failure semantics. While this assumption is not unreasonable, it becomes hard to justify when applications are required to meet high reliability requirements and are built using commercial off the shelf (COTS) components. This paper presents a structuredapproachto extend a crash-tolerant middleware system into an authenticated Byzantine tolerant one with minor modifications to the original system. The proposed approach is based on state machine replication (SMR) and is motivated by the composability features of standard distributed object technologies such as CORBA. SMR is used to assure signal-on-failure(fail-signal) semantics at a level where existing crash-tolerant services can be seamlessly deployed. The resulting system can provide deterministic total ordering without liveness requirements at the service provisioninglevel.We demonstrate our claims of seamless deploymentby porting a crash-tolerant CORBA group communication service. We additionally measure the performance of the resulting system and examine the trade-offs between performance and the rigor with which the fail-signal abstractioncan be built.

[1]  Paul D. Ezhilchelvan,et al.  Implementing Fail-Silent Nodes for Distributed Systems , 1996, IEEE Trans. Computers.

[2]  G. Morgan,et al.  Policies for using replica groups and their effectiveness over the Internet , 2000, COMM '00.

[3]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[4]  Fred B. Schneider,et al.  Replication management using the state-machine approach , 1993 .

[5]  Santosh K. Shrivastava,et al.  The Voltan application programming environment for fail-silent processes , 1998, Distributed Syst. Eng..

[6]  Richard D. Schlichting,et al.  Fail-stop processors: an approach to designing fault-tolerant computing systems , 1983, TOCS.

[7]  Antonio Casimiro,et al.  The Timely Computing Base Model and Architecture , 2002, IEEE Trans. Computers.

[8]  Michel Raynal,et al.  From crash fault-tolerance to arbitrary-fault tolerance: towards a modular approach , 2000, Proceeding International Conference on Dependable Systems and Networks. DSN 2000.

[9]  Priya Narasimhan,et al.  Using Interceptors to Enhance CORBA , 1999, Computer.

[10]  Steve Steinke Fast Ethernet: 100BaseT , 2003 .

[11]  Lorenzo Strigini,et al.  On Designing Dependable Services with Diverse Off-the-Shelf SQL Servers , 2003, WADS.

[12]  Michael K. Reiter,et al.  Byzantine quorum systems , 1997, STOC '97.

[13]  Paulo Veríssimo,et al.  Uncertainty and Predictability: Can They Be Reconciled? , 2003, Future Directions in Distributed Computing.

[14]  Flaviu Cristian,et al.  The Timed Asynchronous Distributed System Model , 1998, IEEE Trans. Parallel Distributed Syst..

[15]  Fred B. Schneider,et al.  Byzantine generals in action: implementing fail-stop processors , 1984, TOCS.

[16]  Louise E. Moser,et al.  The SecureRing protocols for securing group communication , 1998, Proceedings of the Thirty-First Hawaii International Conference on System Sciences.

[17]  Louise E. Moser,et al.  The Eternal System , 2001 .

[18]  Paul D. Ezhilchelvan,et al.  Randomized multivalued consensus , 2001, Fourth IEEE International Symposium on Object-Oriented Real-Time Distributed Computing. ISORC 2001.

[19]  Idit Keidar,et al.  Dynamic voting for consistent primary components , 1997, PODC '97.

[20]  Ravishankar K. Iyer,et al.  Comparing fail-silence provided by process duplication versus internal error detection for DHCP server , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[21]  Newtop: a fault-tolerant group communication protocol , 1995, Proceedings of 15th International Conference on Distributed Computing Systems.

[22]  Fred B. Schneider,et al.  The primary-backup approach , 1993 .

[23]  Hugo Krawczyk,et al.  Keying Hash Functions for Message Authentication , 1996, CRYPTO.

[24]  Danny Dolev,et al.  The Transis approach to high availability cluster communication , 1996, CACM.

[25]  Santosh K. Shrivastava,et al.  Implementing flexible object group invocation in networked systems , 2000, Proceeding International Conference on Dependable Systems and Networks. DSN 2000.

[26]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1983, PODS '83.

[27]  Paul D. Ezhilchelvan,et al.  A Middleware Architecture for Intrusion Tolerant Service Replication , 2002 .

[28]  Miguel Oom Temudo de Castro,et al.  Practical Byzantine fault tolerance , 1999, OSDI '99.