Assessment and Improvement of Hang Detection in the Linux Operating System

We propose a fault injection framework to assess hang detection facilities within the Linux Operating System (OS). The novelty of the framework consists in the adoption of a more representative faultload than existing ones, and in the effectiveness in terms of number of hang failures produced; representativeness is supported by a field data study on the Linux OS. Using the proposed fault injection framework, along with realistic workloads, we find that the Linux OS is unable to detect hangs in several cases. We experience a relative coverage of 75%. To improve detection facilities, we propose a simple yet effective hang detector, which periodically tests OS liveness, as perceived by applications, by means of I/O system calls; it is shown that this approach can improve relative coverage up to 94%. The hang detector can be deployed on any Linux system, with an acceptable overhead.

[1]  Mohamed Kaâniche,et al.  Availability assessment of SunOS/Solaris Unix systems based on syslogd and wtmpx log files: A case study , 2005, 11th Pacific Rim International Symposium on Dependable Computing (PRDC'05).

[2]  Ram Chillarege,et al.  Generation of an error set that emulates software faults based on field data , 1996, Proceedings of Annual Symposium on Fault Tolerant Computing.

[3]  George Candea,et al.  Microreboot - A Technique for Cheap Recovery , 2004, OSDI.

[4]  Marcos K. Aguilera,et al.  On the quality of service of failure detectors , 2000, Proceeding International Conference on Dependable Systems and Networks. DSN 2000.

[5]  Muli Ben-Yehuda,et al.  Vigilant: out-of-band detection of failures in virtual machines , 2008, OPSR.

[6]  Jean Arlat,et al.  Fault Injection for Dependability Validation: A Methodology and Some Applications , 1990, IEEE Trans. Software Eng..

[7]  Jim Gray,et al.  A census of Tandem system availability between 1985 and 1990 , 1990 .

[8]  Jae-Young Pyun,et al.  Secure Multipath Routing Scheme for Mobile Ad Hoc Network , 2007 .

[9]  YangJunfeng,et al.  An empirical study of operating systems errors , 2001 .

[10]  Junfeng Yang,et al.  An empirical study of operating systems errors , 2001, SOSP.

[11]  Liviu Iftode,et al.  Recovering Internet service sessions from operating system failures , 2005, IEEE Internet Computing.

[12]  Ravishankar K. Iyer,et al.  An Architectural Framework for Detecting Process Hangs/Crashes , 2005, EDCC.

[13]  Ravishankar K. Iyer,et al.  Formalizing System Behavior for Evaluating a System Hang Detector , 2008, 2008 Symposium on Reliable Distributed Systems.

[14]  Roy H. Campbell,et al.  Exploring Recovery from Operating System Lockups , 2007, USENIX Annual Technical Conference.

[15]  Mary Baker,et al.  The Recovery Box: Using Fast Recovery to Provide High Availability in the UNIX Environment , 1992, USENIX Summer.

[16]  Brian N. Bershad,et al.  Recovering device drivers , 2004, TOCS.

[17]  Gao Wen,et al.  A proactive fault-detection mechanism in large-scale cluster systems , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[18]  Ming Zhong,et al.  I/O system performance debugging using model-driven anomaly characterization , 2005, FAST'05.

[19]  Dutch T. Meyer,et al.  Remus: High Availability via Asynchronous Virtual Machine Replication. (Best Paper) , 2008, NSDI.

[20]  Henrique Madeira,et al.  Emulation of Software Faults: A Field Data Study and a Practical Approach , 2006, IEEE Transactions on Software Engineering.

[21]  Eliane Martins,et al.  Experimental Risk Assessment and Comparison Using Software Fault Injection , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[22]  Haoxiang Lin,et al.  Hang analysis: fighting responsiveness bugs , 2008, Eurosys '08.

[23]  Jeffrey O. Kephart,et al.  The Vision of Autonomic Computing , 2003, Computer.

[24]  Jon Stearley,et al.  What Supercomputers Say: A Study of Five System Logs , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[25]  Evgenia Smirni,et al.  Anomaly? application change? or workload change? towards automated detection of application performance anomaly and change , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[26]  Daniel P. Siewiorek,et al.  VAX/VMS event monitoring and analysis , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[27]  Zhiling Lan,et al.  Anomaly localization in large-scale clusters , 2007, 2007 IEEE International Conference on Cluster Computing.

[28]  Martin Hiller,et al.  An experimental comparison of fault and error injection , 1998, Proceedings Ninth International Symposium on Software Reliability Engineering (Cat. No.98TB100257).

[29]  Dawson R. Engler,et al.  Checking system rules using system-specific, programmer-written compiler extensions , 2000, OSDI.

[30]  Elaine J. Weyuker,et al.  Testing Component-Based Software: A Cautionary Tale , 1998, IEEE Softw..

[31]  Jean Arlat,et al.  Analysis of the effects of real and injected software faults: Linux as a case study , 2002, 2002 Pacific Rim International Symposium on Dependable Computing, 2002. Proceedings..

[32]  Ravishankar K. Iyer,et al.  Reliability MicroKernel: Providing Application-Aware Reliability in the OS , 2007, IEEE Transactions on Reliability.

[33]  Mark Sullivan,et al.  Software defects and their impact on system availability-a study of field failures in operating systems , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[34]  Roy H. Campbell,et al.  Building a Self-Healing Operating System , 2007, Third IEEE International Symposium on Dependable, Autonomic and Secure Computing (DASC 2007).

[35]  Ravishankar K. Iyer,et al.  Networked Windows NT system field failure data analysis , 1999, Proceedings 1999 Pacific Rim International Symposium on Dependable Computing.