Fault Tolerance via Diversity for Off-the-Shelf Products: A Study with SQL Database Servers

If an off-the-shelf software product exhibits poor dependability due to design faults, then software fault tolerance is often the only way available to users and system integrators to alleviate the problem. Thanks to low acquisition costs, even using multiple versions of software in a parallel architecture, which is a scheme formerly reserved for few and highly critical applications, may become viable for many applications. We have studied the potential dependability gains from these solutions for off-the-shelf database servers. We based the study on the bug reports available for four off-the-shelf SQL servers plus later releases of two of them. We found that many of these faults cause systematic noncrash failures, which is a category ignored by most studies and standard implementations of fault tolerance for databases. Our observations suggest that diverse redundancy would be effective for tolerating design faults in this category of products. Only in very few cases would demands that triggered a bug in one server cause failures in another one, and there were no coincident failures in more than two of the servers. Use of different releases of the same product would also tolerate a significant fraction of the faults. We report our results and discuss their implications, the architectural options available for exploiting them, and the difficulties that they may present.

[1]  Ilir Gashi,et al.  Rephrasing Rules for Off-The-Shelf SQL Database Servers , 2006, 2006 Sixth European Dependable Computing Conference.

[2]  Karl N. Levitt,et al.  The design and implementation of an intrusion tolerant system , 2002, Proceedings International Conference on Dependable Systems and Networks.

[3]  Fred B. Schneider,et al.  Byzantine generals in action: implementing fail-stop processors , 1984, TOCS.

[4]  Nancy G. Leveson,et al.  Analysis of Faults in an N-Version Software Experiment , 1990, IEEE Trans. Software Eng..

[5]  Peter G. Bishop,et al.  An exploration of software faults and failure behaviour in a large population of programs , 2004, 15th International Symposium on Software Reliability Engineering.

[6]  William H. Sanders,et al.  Low-Cost Error Containment and Recovery for Onboard Guarded Software Upgrading and Beyond , 2002, IEEE Trans. Computers.

[7]  Andy J. Wellings,et al.  GUARDS: A Generic Upgradable Architecture for Real-Time Dependable Systems , 1997, IEEE Trans. Parallel Distributed Syst..

[8]  Peter M. Chen,et al.  Whither generic recovery from application faults? A fault study using open-source software , 2000, Proceeding International Conference on Dependable Systems and Networks. DSN 2000.

[9]  Lorenzo Strigini,et al.  Protective Wrapping of OTS Components , 2001, ICSE 2001.

[10]  Bev Littlewood,et al.  Evaluating Testing Methods by Delivered Reliability , 1998, IEEE Trans. Software Eng..

[11]  Feiyi Wang,et al.  SITAR: a scalable intrusion-tolerant architecture for distributed services , 2003, Foundations of Intrusion Tolerant Systems, 2003 [Organically Assured and Survivable Information Systems].

[12]  Jim Gray,et al.  A critique of ANSI SQL isolation levels , 1995, SIGMOD '95.

[13]  Noah Treuhaft,et al.  Recovery Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies , 2002 .

[14]  Vladimir Stankovic,et al.  Improving DBMS Performance through Diverse Redundancy , 2006, 2006 25th IEEE Symposium on Reliable Distributed Systems (SRDS'06).

[15]  Nancy G. Leveson,et al.  An experimental evaluation of the assumption of independence in multiversion programming , 1986, IEEE Transactions on Software Engineering.

[16]  Jim Gray,et al.  Why Do Computers Stop and What Can Be Done About It? , 1986, Symposium on Reliability in Distributed Software and Database Systems.

[17]  Lorenzo Strigini,et al.  Fault Tolerance Against Design Faults , 2005 .

[18]  Lorenzo Strigini,et al.  On Designing Dependable Services with Diverse Off-the-Shelf SQL Servers , 2003, WADS.

[19]  Magnus Almgren,et al.  An Adaptive Intrusion-Tolerant Server Architecture , 2004 .

[20]  Bev Littlewood,et al.  Validation of ultrahigh dependability for software-based systems , 1993, CACM.

[21]  Donald C. O'Shea “Don’t be lazy.” , 2007 .

[22]  Douglas M. Blough,et al.  Voting using predispositions , 1994 .

[23]  Hermann Kopetz,et al.  Fault tolerance, principles and practice , 1990 .

[24]  Lorenzo Strigini,et al.  Adjudicators for diverse-redundant components , 1990, Proceedings Ninth Symposium on Reliable Distributed Systems.

[25]  Peter Alan Lee,et al.  Fault Tolerance , 1990, Dependable Computing and Fault-Tolerant Systems.

[26]  B. Parhami,et al.  Voting : A Paradigm for Adjudication and Data Fusion in Dependable Systems , 2005 .

[27]  Miguel Oom Temudo de Castro,et al.  Practical Byzantine fault tolerance , 1999, OSDI '99.

[28]  Fernando Pedone,et al.  Pronto: a fast failover protocol for off-the-shelf commercial databases , 2000, Proceedings 19th IEEE Symposium on Reliable Distributed Systems SRDS-2000.

[29]  Bev Littlewood,et al.  The effect of testing on reliability of fault-tolerant software , 2004, International Conference on Dependable Systems and Networks, 2004.

[30]  Gustavo Alonso,et al.  Improving the scalability of fault-tolerant database clusters , 2002, Proceedings 22nd International Conference on Distributed Computing Systems.

[31]  Matti A. Hiltunen,et al.  Survivability through customization and adaptability: the Cactus approach , 2000, Proceedings DARPA Information Survivability Conference and Exposition. DISCEX'00.

[32]  Gustavo Alonso,et al.  MIDDLE-R: Consistent database replication at the middleware level , 2005, TOCS.

[33]  Gustavo Alonso,et al.  Don't Be Lazy, Be Consistent: Postgres-R, A New Way to Implement Database Replication , 2000, VLDB.

[34]  Dennis Shasha,et al.  Making snapshot isolation serializable , 2005, TODS.

[35]  Lorenzo Strigini,et al.  Software Fault-Tolerance with Off-the-Shelf SQL Servers , 2004, ICCBSS.

[36]  Ricardo Jiménez-Peris,et al.  Middleware based data replication providing snapshot isolation , 2005, SIGMOD '05.

[37]  Kishor S. Trivedi,et al.  A workload-based analysis of software aging, and rejuvenation , 2005, IEEE Transactions on Reliability.

[38]  Gustavo Alonso,et al.  Database replication techniques: a three parameter classification , 2000, Proceedings 19th IEEE Symposium on Reliable Distributed Systems SRDS-2000.

[39]  Lorenzo Strigini,et al.  Fault diversity among off-the-shelf SQL database servers , 2004, International Conference on Dependable Systems and Networks, 2004.

[40]  Jonathan E. Cook,et al.  Highly reliable upgrading of components , 1999, Proceedings of the 1999 International Conference on Software Engineering (IEEE Cat. No.99CB37002).

[41]  Bev Littlewood,et al.  Modeling software design diversity: a review , 2001, CSUR.

[42]  Magnus Almgren,et al.  An Architecture for an Adaptive Intrusion-Tolerant Server , 2002, Security Protocols Workshop.

[43]  Gustavo Alonso,et al.  Non-intrusive, parallel recovery of replicated data , 2002, 21st IEEE Symposium on Reliable Distributed Systems, 2002. Proceedings..

[44]  Miguel Castro,et al.  Practical byzantine fault tolerance and proactive recovery , 2002, TOCS.

[45]  Miguel Castro,et al.  Using abstraction to improve fault tolerance , 2001, Proceedings Eighth Workshop on Hot Topics in Operating Systems.

[46]  Miguel Castro,et al.  BASE: using abstraction to improve fault tolerance , 2001, SOSP.

[47]  Andreas Reuter,et al.  Transaction Processing: Concepts and Techniques , 1992 .

[48]  Peter M. Chen,et al.  How fail-stop are faulty programs? , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[49]  Paul Ammann,et al.  Data Diversity: An Approach to Software Fault Tolerance , 1988, IEEE Trans. Computers.

[50]  Ravishankar K. Iyer,et al.  Software Dependability in the Tandem GUARDIAN System , 1995, IEEE Trans. Software Eng..

[51]  Marc Dacier,et al.  Design of an Intrusion-Tolerant Intrusion Detection System , 2002 .

[52]  Lorenzo Strigini,et al.  Diversity for off-the-shelf components , 2000 .

[53]  R. Jiménez-Peris,et al.  An Algorithm for Non-Intrusive , Parallel Recovery of Replicated Data and its Correctness , 2002 .

[54]  Ravishankar K. Iyer,et al.  Chameleon: A Software Infrastructure for Adaptive Fault Tolerance , 1999, IEEE Trans. Parallel Distributed Syst..