BASE: Using abstraction to improve fault tolerance

Software errors are a major cause of outages and they are increasingly exploited in malicious attacks. Byzantine fault tolerance allows replicated systems to mask some software errors but it is expensive to deploy. This paper describes a replication technique, BASE, which uses abstraction to reduce the cost of Byzantine fault tolerance and to improve its ability to mask software errors. BASE reduces cost because it enables reuse of off-the-shelf service implementations. It improves availability because each replica can be repaired periodically using an abstract view of the state stored by correct replicas, and because each replica can run distinct or nondeterministic service implementations, which reduces the probability of common mode failures. We built an NFS service where each replica can run a different off-the-shelf file system implementation, and an object-oriented database where the replicas ran the same, nondeterministic implementation. These examples suggest that our technique can be used in practice---in both cases, the implementation required only a modest amount of new code, and our performance results indicate that the replicated services perform comparably to the implementations that they reuse.

[1]  Gustavo Alonso,et al.  Don't Be Lazy, Be Consistent: Postgres-R, A New Way to Implement Database Replication , 2000, VLDB.

[2]  Eric C. Cooper Replicated distributed programs , 1985, SOSP '85.

[3]  Fred B. Schneider,et al.  Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[4]  Miguel Castro,et al.  Practical byzantine fault tolerance and proactive recovery , 2002, TOCS.

[5]  Leslie Lamport,et al.  Reaching Agreement in the Presence of Faults , 1980, JACM.

[6]  Brent Callaghan,et al.  NFS Illustrated , 1999 .

[7]  Rodrigo Seromenho Miragaia Rodrigues,et al.  Combining abstraction with Byzantine fault-tolerance , 2001 .

[8]  Ciprian Tutu,et al.  Practical Wide-Area Database Replication 1 , 2002 .

[9]  C. M. Sperberg-McQueen,et al.  eXtensible Markup Language (XML) 1.0 (Second Edition) , 2000 .

[10]  Michael Williams,et al.  Replication in the harp file system , 1991, SOSP '91.

[11]  William I. Nowicki,et al.  NFS: Network File System Protocol specification , 1989, RFC.

[12]  André Schiper,et al.  Lightweight causal and atomic group multicast , 1991, TOCS.

[13]  Barbara Liskov,et al.  Program Development in Java - Abstraction, Specification, and Object-Oriented Design , 1986 .

[14]  Kyle Geiger,et al.  Inside ODBC , 1995 .

[15]  Yennun Huang,et al.  Software rejuvenation: analysis, module and applications , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[16]  Alexander Romanovsky Faulty version recovery in object-oriented N-version programming , 2000, IEE Proc. Softw..

[17]  David L. Mills,et al.  Network Time Protocol (Version 3) Specification, Implementation , 1992 .

[18]  Keith Marzullo,et al.  Supplying high availability with a standard network file system , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[19]  David J. DeWitt,et al.  The 007 Benchmark , 1993, SIGMOD '93.

[20]  Maurice Herlihy,et al.  Axioms for concurrent objects , 1987, POPL '87.

[21]  Andreas Reuter,et al.  Transaction Processing: Concepts and Techniques , 1992 .

[22]  John K. Ousterhout,et al.  Why Aren't Operating Systems Getting Faster As Fast as Hardware? , 1990, USENIX Summer.

[23]  Daniel P. Siewiorek,et al.  High-availability computer systems , 1991, Computer.

[24]  Priya Narasimhan,et al.  Consistent Object Replication in the external System , 1998, Theory Pract. Object Syst..

[25]  Priya Narasimhan,et al.  Providing support for survivable CORBA applications with the Immune system , 1999, Proceedings. 19th IEEE International Conference on Distributed Computing Systems (Cat. No.99CB37003).

[26]  Robert Gruber,et al.  Efficient optimistic concurrency control using loosely synchronized clocks , 1995, SIGMOD '95.

[27]  Mark Garland Hayden,et al.  The Ensemble System , 1998 .

[28]  Garth A. Gibson,et al.  RAID: high-performance, reliable secondary storage , 1994, CSUR.

[29]  Gustavo Alonso,et al.  Improving the scalability of fault-tolerant database clusters , 2002, Proceedings 22nd International Conference on Distributed Computing Systems.

[30]  Jean Arlat,et al.  MetaKernels and fault containment wrappers , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[31]  Mahadev Satyanarayanan,et al.  Scale and performance in a distributed file system , 1987, SOSP '87.

[32]  Liming Chen,et al.  N-VERSION PROGRAMMINC: A FAULT-TOLERANCE APPROACH TO RELlABlLlTY OF SOFTWARE OPERATlON , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[33]  Miguel Castro,et al.  Proactive recovery in a Byzantine-fault-tolerant system , 2000, OSDI.

[34]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[35]  Miguel Castro,et al.  HAC: hybrid adaptive caching for distributed storage systems , 1997, SOSP.

[36]  Fred B. Schneider,et al.  Hypervisor-based fault tolerance , 1996, TOCS.

[37]  Silvano Maffeis,et al.  Adding Group Communication and Fault-Tolerance to CORBA , 1995, COOTS.

[38]  Miguel Castro,et al.  Providing Persistent Objects in Distributed Systems , 1999, ECOOP.

[39]  Michael Stonebraker,et al.  The Implementation of Postgres , 1990, IEEE Trans. Knowl. Data Eng..

[40]  Miguel Oom Temudo de Castro,et al.  Practical Byzantine fault tolerance , 1999, OSDI '99.

[41]  David J. DeWitt,et al.  The oo7 Benchmark , 1993, SIGMOD Conference.

[42]  David L. Mills,et al.  Network Time Protocol (Version 3) Specification, Implementation and Analysis , 1992, RFC.

[43]  B. Ramkumar,et al.  Portable checkpointing for heterogeneous architectures , 1997, Proceedings of IEEE 27th International Symposium on Fault Tolerant Computing.

[44]  Sanjay Ghemawat,et al.  The Modified Object Buffer: A Storage Management Technique for Object-Oriented Databases , 1995 .