Assessing and improving the effectiveness of logs for the analysis of software faults

Event logs are the primary source of data to characterize the dependability behavior of a computing system during the operational phase. However, they are inadequate to provide evidence of software faults, which are nowadays among the main causes of system outages. This paper proposes an approach based on software fault injection to assess the effectiveness of logs to keep track of software faults triggered in the field. Injection results are used to provide guidelines to improve the ability of logging mechanisms to report the effects of software faults. The benefits of the approach are shown by means of experimental results on three widely used software systems.

[1]  Lorenzo Keller,et al.  ConfErr: A tool for assessing resilience to human configuration errors , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[2]  Zhiling Lan,et al.  System log pre-processing to improve failure prediction , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.

[3]  Kishor S. Trivedi,et al.  Performability Modeling Based on Real Data: A Case Study , 1988, IEEE Trans. Computers.

[4]  Jörgen Christmansson,et al.  Error injection aimed at fault removal in fault tolerance mechanisms-criteria for error selection using field data on software faults , 1996, Proceedings of ISSRE '96: 7th International Symposium on Software Reliability Engineering.

[5]  Ravishankar K. Iyer,et al.  Networked Windows NT system field failure data analysis , 1999, Proceedings 1999 Pacific Rim International Symposium on Dependable Computing.

[6]  Domenico Cotroneo,et al.  Failure classification and analysis of the Java Virtual Machine , 2006, 26th IEEE International Conference on Distributed Computing Systems (ICDCS'06).

[7]  Ravishankar K. Iyer,et al.  Failure data analysis of a LAN of Windows NT based computers , 1999, Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems.

[8]  Henrique Madeira,et al.  Emulation of Software Faults: A Field Data Study and a Practical Approach , 2006, IEEE Transactions on Software Engineering.

[9]  Eliane Martins,et al.  Experimental Risk Assessment and Comparison Using Software Fault Injection , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[10]  Jean Arlat,et al.  IEEE Transactions on Dependable and Secure Computing , 2006 .

[11]  Jeffrey M. Voas,et al.  Predicting How Badly "Good" Software Can Behave , 1997, IEEE Softw..

[12]  Jim Gray,et al.  Why Do Computers Stop and What Can Be Done About It? , 1986, Symposium on Reliability in Distributed Software and Database Systems.

[13]  Miroslaw Malek,et al.  Using Hidden Semi-Markov Models for Effective Online Failure Prediction , 2007, 2007 26th IEEE International Symposium on Reliable Distributed Systems (SRDS 2007).

[14]  Zbigniew T. Kalbarczyk,et al.  Reflections on industry trends and experimental research in dependability , 2004, IEEE Transactions on Dependable and Secure Computing.

[15]  Carl E. Landwehr,et al.  Basic concepts and taxonomy of dependable and secure computing , 2004, IEEE Transactions on Dependable and Secure Computing.

[16]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[17]  Luís Moura Silva Comparing Error Detection Techniques for Web Applications: An Experimental Study , 2008, 2008 Seventh IEEE International Symposium on Network Computing and Applications.

[18]  Mohamed Kaâniche,et al.  Availability assessment of SunOS/Solaris Unix systems based on syslogd and wtmpx log files: A case study , 2005, 11th Pacific Rim International Symposium on Dependable Computing (PRDC'05).

[19]  Michael I. Jordan,et al.  Detecting large-scale system problems by mining console logs , 2009, SOSP '09.

[20]  Henrique Madeira,et al.  Generic faultloads based on software faults for dependability benchmarking , 2004, International Conference on Dependable Systems and Networks, 2004.

[21]  Daniel P. Siewiorek,et al.  Models for time coalescence in event logs , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[22]  Archana Ganapathi,et al.  Why Do Internet Services Fail, and What Can Be Done About It? , 2002, USENIX Symposium on Internet Technologies and Systems.

[23]  Jon Stearley,et al.  What Supercomputers Say: A Study of Five System Logs , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[24]  Daniel P. Siewiorek,et al.  VAX/VMS event monitoring and analysis , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.