Log-Based Failure Analysis of Complex Systems: Methodology and Relevant Applications

Failure analysis is valuable to dependability engineers because it supports designing effective mitigation means, defining strategies to reduce maintenance costs, and improving system service. Event logs, which contain textual information about regular and anomalous events detected by the system under real workload conditions, represent a key source of data to conduct failure analysis. So far, event logs have been successfully used in a variety of domains. This chapter describes methodology and well-established techniques underlying log-based failure analysis. Description introduces the workflow leading to analysis results starting from the raw data in the log. Moreover, the chapter surveys relevant works in the area with the aim of highlighting main objectives and applications of log-based failure analysis. Discussion reveals benefits and limitations of logs for evaluating complex systems.

[1]  Anand Sivasubramaniam,et al.  Critical event prediction for proactive management in large-scale computer clusters , 2003, KDD '03.

[2]  Miroslaw Malek,et al.  Using Hidden Semi-Markov Models for Effective Online Failure Prediction , 2007, 2007 26th IEEE International Symposium on Reliable Distributed Systems (SRDS 2007).

[3]  Anand Sivasubramaniam,et al.  BlueGene/L Failure Analysis and Prediction Models , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[4]  Domenico Cotroneo,et al.  Improving Log-based Field Failure Data Analysis of multi-node computing systems , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN).

[5]  Daniel P. Siewiorek,et al.  Workload, Performance, and Reliability of Digital Computing Systems. , 1980 .

[6]  Ravishankar K. Iyer,et al.  Analyze-NOW-an environment for collection and analysis of failures in a network of workstations , 1996, IEEE Trans. Reliab..

[7]  Daniel P. Siewiorek,et al.  Error log analysis: statistical modeling and heuristic trend analysis , 1990 .

[8]  Ravishankar K. Iyer,et al.  Recognition of Error Symptoms in Large Systems , 1986, FJCC.

[9]  Ravishankar K. Iyer,et al.  Measurement and modeling of computer reliability as affected by system activity , 1986, TOCS.

[10]  Cheng-Zhong Xu,et al.  Exploring event correlation for failure prediction in coalitions of clusters , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[11]  Harold S. Stone,et al.  Proceedings of 1986 ACM Fall joint computer conference , 1986 .

[12]  Ravishankar K. Iyer,et al.  Analysis of security data from a large computing organization , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN).

[13]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[14]  Daniel P. Siewiorek,et al.  A Performance-Reliability Model for Computing Systems, , 1980 .

[15]  Domenico Cotroneo,et al.  Failure classification and analysis of the Java Virtual Machine , 2006, 26th IEEE International Conference on Distributed Computing Systems (ICDCS'06).

[16]  A. Pecchia,et al.  A Logging Approach for Effective Dependability Evaluation of Complex Systems , 2009, 2009 Second International Conference on Dependability.

[17]  David A. Patterson,et al.  Studying and using failure data from large-scale internet services , 2002, EW 10.

[18]  Archana Ganapathi,et al.  Why Do Internet Services Fail, and What Can Be Done About It? , 2002, USENIX Symposium on Internet Technologies and Systems.

[19]  Robin R. Murphy,et al.  Reliability analysis of mobile robots , 2003, 2003 IEEE International Conference on Robotics and Automation (Cat. No.03CH37422).

[20]  M. Brun,et al.  Critical software for nuclear reactors: 11 years of field experience analysis , 1998, Proceedings Ninth International Symposium on Software Reliability Engineering (Cat. No.98TB100257).

[21]  Risto Vaarandi,et al.  A data clustering algorithm for mining patterns from event logs , 2003, Proceedings of the 3rd IEEE Workshop on IP Operations & Management (IPOM 2003) (IEEE Cat. No.03EX764).

[22]  Domenico Cotroneo,et al.  Automated logging of mobile phones failures data , 2006, Ninth IEEE International Symposium on Object and Component-Oriented Real-Time Distributed Computing (ISORC'06).

[23]  Brendan Murphy,et al.  Windows 2000 Dependability , 2000 .

[24]  Günter Haring,et al.  Performance Evaluation: Origins and Directions , 2000, Lecture Notes in Computer Science.

[25]  Domenico Cotroneo,et al.  Identifying Compromised Users in Shared Computing Infrastructures: A Data-Driven Bayesian Network Approach , 2011, 2011 IEEE 30th International Symposium on Reliable Distributed Systems.

[26]  Ravishankar K. Iyer,et al.  Error/failure analysis using event logs from fault tolerant systems , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[27]  Robin Berthier,et al.  A Statistical Analysis of Attack Data to Separate Attacks , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[28]  Ravishankar K. Iyer,et al.  Dependability Measurement and Modeling of a Multicomputer System , 1993, IEEE Trans. Computers.

[29]  Kishor S. Trivedi,et al.  Performability Modeling Based on Real Data: A Case Study , 1988, IEEE Trans. Computers.

[30]  Daniel P. Siewiorek,et al.  WORKLOAD, PERFORMANCE, AND RELlABlLlTY OF DIGITAL COMPUTlNG SYSTEMS , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[31]  Mohamed Kaâniche,et al.  Event log based dependability analysis of Windows NT and 2K systems , 2002, 2002 Pacific Rim International Symposium on Dependable Computing, 2002. Proceedings..

[32]  Navjot Singh,et al.  A log mining approach to failure analysis of enterprise telephony systems , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[33]  Chris Lonvick,et al.  The BSD Syslog Protocol , 2001, RFC.

[34]  Ram Chillarege,et al.  Measurement of failure rate in widely distributed software , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[35]  Risto Vaarandi,et al.  SEC - a lightweight event correlation tool , 2002, IEEE Workshop on IP Operations and Management.

[36]  Dong Tang,et al.  MEADEP: a dependability evaluation tool for engineers , 1998 .

[37]  Ravishankar K. Iyer,et al.  Automatic Recognition of Intermittent Failures: An Experimental Study of Field Data , 1990, IEEE Trans. Computers.

[38]  Ravishankar K. Iyer,et al.  Measurement-based Analysis of Networked System Availability , 2000, Performance Evaluation.

[39]  Ravishankar K. Iyer,et al.  Failure data analysis of a LAN of Windows NT based computers , 1999, Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems.

[40]  Mohamed Kaâniche,et al.  Availability assessment of SunOS/Solaris Unix systems based on syslogd and wtmpx log files: A case study , 2005, 11th Pacific Rim International Symposium on Dependable Computing (PRDC'05).

[41]  Gwan S. Choi,et al.  Error and failure analysis of a UNIX server , 1998, Proceedings Third IEEE International High-Assurance Systems Engineering Symposium (Cat. No.98EX231).

[42]  Jon Stearley,et al.  What Supercomputers Say: A Study of Five System Logs , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[43]  Domenico Cotroneo,et al.  Collecting and Analyzing Failure Data of Bluetooth Personal Area Networks , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[44]  Becky Verastegui,et al.  Proceedings of the 2007 ACM/IEEE conference on Supercomputing , 2007, HiPC 2007.

[45]  Domenico Cotroneo,et al.  Assessing and improving the effectiveness of logs for the analysis of software faults , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).

[46]  Domenico Cotroneo,et al.  Dependability Evaluation and Modeling of the Bluetooth Data Communication Channel , 2008, 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP 2008).

[47]  Archana Ganapathi,et al.  Crash data collection: a Windows case study , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[48]  Daniel P. Siewiorek,et al.  Models for time coalescence in event logs , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[49]  A. Sathaye,et al.  Validating complex computer system availability models , 1990 .

[50]  Robert E. Mullen,et al.  The lognormal distribution of software failure rates: origin and evidence , 1998, Proceedings Ninth International Symposium on Software Reliability Engineering (Cat. No.98TB100257).

[51]  Lawrence G. Votta,et al.  Analysis of failure and recovery rates in a wireless telecommunications system , 2002, Proceedings International Conference on Dependable Systems and Networks.

[52]  Daniel P. Siewiorek,et al.  A comparative analysis of event tupling schemes , 1996, Proceedings of Annual Symposium on Fault Tolerant Computing.

[53]  Marc Dacier,et al.  Honeypots: practical means to validate malicious fault assumptions , 2004, 10th IEEE Pacific Rim International Symposium on Dependable Computing, 2004. Proceedings..

[54]  Hector Garcia-Molina,et al.  The vulnerability of vote assignments , 1986, TOCS.