Towards the Design of Efficient Error Detection Mechanisms for Transient Data Errors

The pervasive nature of modern computer systems has led to an increase in our reliance on such systems to provide correct and timely services. Moreover, as the functionality of computer systems is being increasingly defined in software, it is imperative that software be dependable. It has previously been shown that a fault intolerant software system can be made fault tolerant through the design and deployment of software mechanisms implementing abstract artefacts known as error detection mechanisms (EDMs) and error recovery mechanisms (ERMs), hence the design of these components is central to the design of dependable software systems. The EDM design problem, which relates to the construction of a boolean predicate over a set of program variables, is inherently difficult, with current approaches relying on system specifications and the experience of software engineers. As this process necessarily entails the identification and incorporation of program variables by an error detection predicate, this thesis seeks to address the EDM design problem from a novel variable-centric perspective, with the research presented supporting the thesis that, where it exists under the assumed system model, an efficient EDM consists of a set of critical variables. In particular, this research proposes (i) a metric suite that can be used to generate a relative ranking of the program variables in a software with respect to their criticality, (ii) a systematic approach for the generation of highly-efficient error detection predicates for EDMs, and (iii) an approach for dependability enhancement based on the protection of critical variables using software wrappers that implement error detection and correction predicates that are known to be efficient. This research substantiates the thesis that an efficient EDM contains a set of critical variables on the basis that (i) the proposed metric suite is able, through application of an appropriate threshold, to identify critical variables, (ii) efficient EDMs can be constructed based only on the critical variables identified by the metric suite, and (iii) the criticality of the identified variables can be shown to extend across a software module such that an efficient EDM designed for that software module should seek to determine the correctness of the identified variables.

[1]  Sarita V. Adve,et al.  Using likely program invariants to detect hardware errors , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[2]  Hagen Völzer Verifying Fault Tolerance of Distributed Algorithms Formally - An Example , 1998, ACSD.

[3]  David A. Schmidt Data flow analysis is model checking of abstract interpretations , 1998, POPL '98.

[4]  Michael J. Pazzani,et al.  Reducing Misclassification Costs , 1994, ICML.

[5]  Jeffrey M. Voas Building software recovery assertions from a fault injection-based propagation analysis , 1997, Proceedings Twenty-First Annual International Computer Software and Applications Conference (COMPSAC'97).

[6]  Anish Arora,et al.  Detectors and correctors: a theory of fault-tolerance components , 1998, Proceedings. 18th International Conference on Distributed Computing Systems (Cat. No.98CB36183).

[7]  Anish Arora,et al.  Distributed Reset , 1994, IEEE Trans. Computers.

[8]  Stephen McCamant,et al.  Inference and enforcement of data structure consistency specifications , 2006, ISSTA '06.

[9]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[10]  Neeraj Suri,et al.  PROPANE: an environment for examining the propagation of errors in software , 2002, ISSTA '02.

[11]  Hermann Kopetz,et al.  Dependability: Basic Concepts and Terminology , 1992 .

[12]  Johan Karlsson,et al.  Reducing critical failures for control algorithms using executable assertions and best effort recovery , 2001, 2001 International Conference on Dependable Systems and Networks.

[13]  Ravishankar K. Iyer,et al.  Automated Derivation of Application-Aware Error Detectors Using Static Analysis: The Trusted Illiac Approach , 2011, IEEE Transactions on Dependable and Secure Computing.

[14]  Martin Hiller,et al.  Executable assertions for detecting data errors in embedded control systems , 2000, Proceeding International Conference on Dependable Systems and Networks. DSN 2000.

[15]  Jean Arlat,et al.  Estimators for Fault Tolerance Coverage Evaluation , 1995, IEEE Trans. Computers.

[16]  Jean Arlat,et al.  Fault injection for formal testing of fault tolerance , 1996, IEEE Trans. Reliab..

[17]  John Langford,et al.  Cost-sensitive learning by cost-proportionate example weighting , 2003, Third IEEE International Conference on Data Mining.

[18]  Kai Ming Ting,et al.  An Instance-weighting Method to Induce Cost-sensitive Trees , 2001 .

[19]  Ali Ebnenasir,et al.  The complexity of adding failsafe fault-tolerance , 2002, Proceedings 22nd International Conference on Distributed Computing Systems.

[20]  William G. Griswold,et al.  Dynamically discovering likely program invariants to support program evolution , 1999, Proceedings of the 1999 International Conference on Software Engineering (IEEE Cat. No.99CB37002).

[21]  Edmund M. Clarke,et al.  Model Checking , 1999, Handbook of Automated Reasoning.

[22]  Anish Arora,et al.  Automating the Addition of Fault-Tolerance , 2000, FTRTFT.

[23]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[24]  Nancy G. Leveson,et al.  The Use of Self Checks and Voting in Software Error Detection: An Empirical Study , 1990, IEEE Trans. Software Eng..

[25]  Arshad Jhumka,et al.  Issues on the Design of Efficient Fail-Safe Fault Tolerance , 2009, 2009 20th International Symposium on Software Reliability Engineering.

[26]  David A. Cieslak,et al.  Automatically countering imbalance and its empirical relationship to cost , 2008, Data Mining and Knowledge Discovery.

[27]  Patrick Cousot,et al.  Abstract interpretation: a unified lattice model for static analysis of programs by construction or approximation of fixpoints , 1977, POPL.

[28]  Salvatore J. Stolfo,et al.  AdaCost: Misclassification Cost-Sensitive Boosting , 1999, ICML.

[29]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[30]  Marco Vieira,et al.  A Data Mining Approach to Identify Key Factors in Dependability Experiments , 2005, EDCC.

[31]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[32]  Neeraj Suri,et al.  An approach for designing and assessing detectors for dependable component-based systems , 2004, Eighth IEEE International Symposium on High Assurance Systems Engineering, 2004. Proceedings..

[33]  Martin Hiller,et al.  An experimental comparison of fault and error injection , 1998, Proceedings Ninth International Symposium on Software Reliability Engineering (Cat. No.98TB100257).

[34]  David D. Lewis,et al.  Heterogeneous Uncertainty Sampling for Supervised Learning , 1994, ICML.

[35]  Stan Matwin,et al.  Machine Learning for the Detection of Oil Spills in Satellite Radar Images , 1998, Machine Learning.

[36]  Pedro M. Domingos MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[37]  Neeraj Suri,et al.  An approach to synthesise safe systems , 2006, Int. J. Secur. Networks.

[38]  Salvatore J. Stolfo,et al.  A data mining framework for building intrusion detection models , 1999, Proceedings of the 1999 IEEE Symposium on Security and Privacy (Cat. No.99CB36344).

[39]  Arnold P. Boedihardjo,et al.  Exploiting efficient data mining techniques to enhance intrusion detection systems , 2005, IRI -2005 IEEE International Conference on Information Reuse and Integration, Conf, 2005..