Investigation of failure causes in workload-driven reliability testing

Virtual execution environments and middleware are required to be extremely reliable because applications running on top of them are developed assuming their correctness, and platform-level failures can result in serious and unexpected application-level problems. Since software platforms and middleware are often executed for long time without any interruption, large part of the testing process is devoted to investigate their behavior when long and stressful executions occur (these test cases are called workloads). When a problem is identified, software engineers examine log files to find its root cause. Unfortunately, since of the workloads length, log files can contain a huge amount of information and manual analysis is often prohibitive. Thus, de-facto, the identification of the root cause is mostly left to the intuition of the software engineer. In this paper, we propose a technique to automatically analyze logs obtained from workloads to retrieve important information that can relate the failure to its cause. The technique works in three steps: (1) during workload executions, the system under test is monitored; (2) logs extracted from workloads that have been successfully completed are used to derive compact and general models of the expected behavior of the target system; (3) logs corresponding to workloads terminated unsuccessfully are compared with the inferred models to identify anomalous event sequences. Anomalies help software engineers to identify failure causes. The technique can also be used during operational phase, to discover possible causes of unexpected failures by comparing logs corresponding to failing executions with models derived at testing time. Preliminary experimental results conducted on the Java Virtual Machine indicate that several bugs can be rapidly identified thanks to the feedbacks provided by our technique.

[1]  Domenico Cotroneo,et al.  Failure classification and analysis of the Java Virtual Machine , 2006, 26th IEEE International Conference on Distributed Computing Systems (ICDCS'06).

[2]  Marco Vieira,et al.  A Data Mining Approach to Identify Key Factors in Dependability Experiments , 2005, EDCC.

[3]  James E. Smith,et al.  The architecture of virtual machines , 2005, Computer.

[4]  Jason Gait,et al.  A probe effect in concurrent programs , 1986, Softw. Pract. Exp..

[5]  James H. Andrews,et al.  Testing using log file analysis: tools, methods, and issues , 1998, Proceedings 13th IEEE International Conference on Automated Software Engineering (Cat. No.98EX239).

[6]  Daniel P. Siewiorek,et al.  Error log analysis: statistical modeling and heuristic trend analysis , 1990 .

[7]  Guy L. Steele,et al.  The Java Language Specification , 1996 .

[8]  Frank Yellin,et al.  The Java Virtual Machine Specification , 1996 .

[9]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[10]  Leonardo Mariani,et al.  Dynamic Detection of COTS Component Incompatibility , 2007, IEEE Software.

[11]  Charles E. McDowell,et al.  Debugging concurrent programs , 1989, ACM Comput. Surv..

[12]  Guy L. Steele,et al.  Java(TM) Language Specification, The (3rd Edition) (Java (Addison-Wesley)) , 2005 .

[13]  Stefano Russo,et al.  Java Virtual Machine monitoring for dependability benchmarking , 2006, Ninth IEEE International Symposium on Object and Component-Oriented Real-Time Distributed Computing (ISORC'06).

[14]  Zbigniew T. Kalbarczyk,et al.  Reflections on industry trends and experimental research in dependability , 2004, IEEE Transactions on Dependable and Secure Computing.

[15]  Ravishankar K. Iyer,et al.  Error/failure analysis using event logs from fault tolerant systems , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.