Architecting Dependable Systems with Proactive Fault Management

Management of an ever-growing complexity of computing systems is an everlasting challenge for computer system engineers. We argue that we need to resort to predictive technologies in order to harness the system's complexity and transform a vision of proactive system and failure management into reality. We describe proactive fault management, provide an overview and taxonomy for online failure prediction methods and present a classification of failure prediction-triggered methods. We present a model to assess the effects of proactive fault management on system reliability and show that overall dependability can significantly be enhanced. After having shown the methods and potential of proactive fault management we describe a blueprint how proactive fault management can be incorporated into a dependable system's architecture.

[1]  Flaviu Cristian,et al.  Atomic Broadcast: From Simple Message Diffusion to Byzantine Agreement , 1995, Inf. Comput..

[2]  Kishor S. Trivedi,et al.  Analysis and implementation of software rejuvenation in cluster systems , 2001, SIGMETRICS '01.

[3]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[4]  Ricardo Vilalta,et al.  A Perspective View and Survey of Meta-Learning , 2002, Artificial Intelligence Review.

[5]  Tong Liu,et al.  Availability prediction and modeling of high mobility OSCAR cluster , 2003, 2003 Proceedings IEEE International Conference on Cluster Computing.

[6]  Miroslaw Malek,et al.  Call Availability Prediction in a Telecommunication System: A Data Driven Empirical Approach , 2006, 2006 25th IEEE Symposium on Reliable Distributed Systems (SRDS'06).

[7]  Miroslaw Malek In Search of Real Data on Faults, Errors and Failures , 2006, 2006 Sixth European Dependable Computing Conference.

[8]  George Candea,et al.  Improving availability with recursive microreboots: a soft-state system case study , 2004, Perform. Evaluation.

[9]  William Farr,et al.  Software reliability modeling survey , 1996 .

[10]  Miroslaw Malek,et al.  On tolerating faults in naturally redundant algorithms , 1991, [1991] Proceedings Tenth Symposium on Reliable Distributed Systems.

[11]  Joseph L. Hellerstein,et al.  Predictive algorithms in the management of computer systems , 2002, IBM Syst. J..

[12]  Ulf Westberg,et al.  Maintenance scheduling under age replacement policy using proportional hazards model and TTT-plotting , 1997 .

[13]  Michèle Basseville,et al.  Detection of abrupt changes: theory and application , 1993 .

[14]  Attila Csenki Bayes predictive analysis of a fundamental software reliability model , 1990 .

[15]  Daniel P. Siewiorek,et al.  Reliable computer systems - design and evaluation (3. ed.) , 1992 .

[16]  Kishor S. Trivedi,et al.  Analysis of Preventive Maintenance in Transactions Based Software Systems , 1998, IEEE Trans. Computers.

[17]  Ramendra K. Sahoo,et al.  Evaluating cooperative checkpointing for supercomputing systems , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[18]  Ting-Ting Yao Lin Design and evaluation of an on-line predictive diagnostic system , 1988 .

[19]  Miroslaw Malek,et al.  A survey of online failure prediction methods , 2010, CSUR.

[20]  Michael Tortorella,et al.  Reliability Theory: With Applications to Preventive Maintenance , 2001, Technometrics.

[21]  A. Avizienis,et al.  Dependable computing: From concepts to design diversity , 1986, Proceedings of the IEEE.

[22]  Peter A. Flach The Geometry of ROC Space: Understanding Machine Learning Metrics through ROC Isometrics , 2003, ICML.

[23]  Zhiling Lan,et al.  Exploit failure prediction for adaptive fault-tolerance in cluster computing , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).

[24]  Laxmikant V. Kale,et al.  Proactive Fault Tolerance in Large Systems , 2004 .

[25]  Tadashi Dohi,et al.  Analysis of software cost models with rejuvenation , 2000, Proceedings. Fifth IEEE International Symposium on High Assurance Systems Engineering (HASE 2000).

[26]  Joseph L. Hellerstein,et al.  Using Control Theory to Achieve Service Level Objectives In Performance Management , 2001, 2001 IEEE/IFIP International Symposium on Integrated Network Management Proceedings. Integrated Network Management VII. Integrated Management Strategies for the New Millennium (Cat. No.01EX470).

[27]  Kishor S. Trivedi,et al.  Adaptive software rejuvenation: degradation model and rejuvenation scheme , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..

[28]  Günther A. Hoffmann,et al.  Failure prediction in complex computer systems: a probabilistic approach , 2006 .

[29]  Daniel P. Siewiorek,et al.  Error log analysis: statistical modeling and heuristic trend analysis , 1990 .

[30]  S. Scott,et al.  A Failure Predictive and Policy-Based High Availability Strategy for Linux High Performance Computing Cluster , 2004 .

[31]  Luís Moura Silva,et al.  Deterministic Models of Software Aging and Optimal Rejuvenation Schedules , 2007, 2007 10th IFIP/IEEE International Symposium on Integrated Network Management.

[32]  Rajeev Thakur,et al.  A Meta-Learning Failure Predictor for Blue Gene/L Systems , 2007, 2007 International Conference on Parallel Processing (ICPP 2007).

[33]  Kishor S. Trivedi,et al.  A comprehensive model for software rejuvenation , 2005, IEEE Transactions on Dependable and Secure Computing.

[34]  Bruno Cernuschi-Frías,et al.  A nonparametric nonstationary procedure for failure prediction , 2002, IEEE Trans. Reliab..

[35]  Martin D. Buhmann,et al.  Radial Basis Functions: Theory and Implementations: Preface , 2003 .

[36]  Kishor S. Trivedi,et al.  Proactive management of software aging , 2001, IBM J. Res. Dev..

[37]  A structured approach to the selection of condition based maintenance , 1997 .

[38]  David Lorge Parnas,et al.  Software aging , 1994, Proceedings of 16th International Conference on Software Engineering.

[39]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[40]  R. W. King,et al.  Model-based nuclear power plant monitoring and fault detection: Theoretical foundations , 1997 .

[41]  Santosh K. Shrivastava,et al.  Reliable Computer Systems , 1985, Texts and Monographs in Computer Science.

[42]  Cristina Nita-Rotaru,et al.  A survey of attack and defense techniques for reputation systems , 2009, CSUR.

[43]  Felix Salfner,et al.  Event-based Failure Prediction: An Extended Hidden Markov Model Approach , 2008, Ausgezeichnete Informatikdissertationen.

[44]  Daniel P. Siewiorek,et al.  Reliable computer systems (2nd ed.): design and evaluation , 1992 .

[45]  Kishor S. Trivedi,et al.  A methodology for detection and estimation of software aging , 1998, Proceedings Ninth International Symposium on Software Reliability Engineering (Cat. No.98TB100257).

[46]  Carl E. Landwehr,et al.  Basic concepts and taxonomy of dependable and secure computing , 2004, IEEE Transactions on Dependable and Secure Computing.

[47]  Ravishankar K. Iyer,et al.  Recognition of Error Symptoms in Large Systems , 1986, FJCC.

[48]  Huaglory Tianfield,et al.  A concise introduction to autonomic computing , 2005, Adv. Eng. Informatics.

[49]  Haw Ching Yang,et al.  Application Cluster Service Scheme for Near-Zero-Downtime Services , 2005, Proceedings of the 2005 IEEE International Conference on Robotics and Automation.

[50]  L. McLaughlin,et al.  Optimal design of a condition-based maintenance model , 2004, Annual Symposium Reliability and Maintainability, 2004 - RAMS.

[51]  Dorothy M. Andrews,et al.  A Methodology for Analysis of Failure Prediction Data , 1985, RTSS.

[52]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[53]  P. J. Gardner A transportation of ALGOL68C , 1977 .

[54]  Petr Jan Horn,et al.  Autonomic Computing: IBM's Perspective on the State of Information Technology , 2001 .

[55]  Yennun Huang,et al.  Software rejuvenation: analysis, module and applications , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[56]  Brian Randell,et al.  System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.

[57]  Kishor S. Trivedi,et al.  A Best Practice Guide to Resource Forecasting for Computing Systems , 2007, IEEE Transactions on Reliability.

[58]  P. M. Melliar-Smith,et al.  Software reliability: The role of programmed exception handling , 1977, Language Design for Reliable Software.

[59]  Miroslaw Malek,et al.  The consensus problem in fault-tolerant computing , 1993, CSUR.

[60]  David A. Patterson,et al.  Embracing Failure: A Case for Recovery-Oriented Computing (ROC) , 2001 .

[61]  Mira Kajko-Mattsson,et al.  Can we learn anything from hardware preventive maintenance? , 2001, Proceedings Seventh IEEE International Conference on Engineering of Complex Computer Systems.

[62]  George Candea,et al.  Automatic failure-path inference: a generic introspection technique for Internet applications , 2003, Proceedings the Third IEEE Workshop on Internet Applications. WIAPP 2003.

[63]  Brian Randell,et al.  Reliability Issues in Computing System Design , 1978, CSUR.

[64]  V. Kulkarni Modeling and Analysis of Stochastic Systems , 1996 .

[65]  J R Pinkert,et al.  Reliable computer systems. , 1993, Journal of AHIMA.

[66]  David Sinreich,et al.  An architectural blueprint for autonomic computing , 2006 .

[67]  Ram Chillarege,et al.  Early warning of failures through alarm analysis a case study in telecom voice mail systems , 2003, 14th International Symposium on Software Reliability Engineering, 2003. ISSRE 2003..

[68]  Daniel P. Siewiorek,et al.  Reliable Computer Systems: Design and Evaluation, Third Edition , 1998 .

[69]  Kishor S. Trivedi,et al.  The fundamentals of software aging , 2008, 2008 IEEE International Conference on Software Reliability Engineering Workshops (ISSRE Wksp).

[70]  Jean-Claude Laprie,et al.  Software reliability and system reliability , 1996 .

[71]  C. R. Cassady,et al.  Characterization of optimal age-replacement policies , 1998, Annual Reliability and Maintainability Symposium. 1998 Proceedings. International Symposium on Product Quality and Integrity.

[72]  Kishor S. Trivedi,et al.  Fighting bugs: remove, retry, replicate, and rejuvenate , 2007, Computer.

[73]  Tadashi Dohi,et al.  Statistical non-parametric algorithms to estimate the optimal software rejuvenation schedule , 2000, Proceedings. 2000 Pacific Rim International Symposium on Dependable Computing.

[74]  Martin D. Buhmann,et al.  Radial Basis Functions , 2021, Encyclopedia of Mathematical Geosciences.