What Is System Hang and How to Handle It

Almost every computer user has encountered an un-responsive system failure or system hang, which leaves the user no choice but to power off the computer. In this paper, the causes of such failures are analyzed in detail and one empirical hypothesis for detecting system hang is proposed. This hypothesis exploits a small set of system performance metrics provided by the OS itself, thereby avoiding modifying the OS kernel and introducing additional cost (e.g., hardware modules). Under this hypothesis, we propose SHFH, a self-healing framework to handle system hang, which can be deployed on OS dynamically. One unique feature of SHFH is that its "light-heavy" detection strategy is designed to make intelligent tradeoffs between the performance overhead and the false positive rate induced by system hang detection. Another feature is that its diagnosis-based recovery strategy offers a better granularity to recover from system hang. Our experimental results show that SHFH can cover 95.34% of system hang scenarios, with a false positive rate of 0.58% and 0.6% performance overhead, validating the effectiveness of our empirical hypothesis.

[1]  Junfeng Yang,et al.  An empirical study of operating systems errors , 2001, SOSP.

[2]  Ravishankar K. Iyer,et al.  Characterization of linux kernel behavior under errors , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..

[3]  Roy H. Campbell,et al.  Exploring Recovery from Operating System Lockups , 2007, USENIX Annual Technical Conference.

[4]  Roy H. Campbell,et al.  Building a Self-Healing Operating System , 2007, Third IEEE International Symposium on Dependable, Autonomic and Secure Computing (DASC 2007).

[5]  YangJunfeng,et al.  An empirical study of operating systems errors , 2001 .

[6]  Ravishankar K. Iyer,et al.  Error Behavior Comparison of Multiple Computing Systems: A Case Study Using Linux on Pentium, Solaris on SPARC, and AIX on POWER , 2008, 2008 14th IEEE Pacific Rim International Symposium on Dependable Computing.

[7]  Christophe Calvès,et al.  Faults in linux: ten years later , 2011, ASPLOS XVI.

[8]  Domenico Cotroneo,et al.  OS-level hang detection in complex software systems , 2011, Int. J. Crit. Comput. Based Syst..

[9]  Ravishankar K. Iyer,et al.  Faults, symptoms, and software fault tolerance in the Tandem GUARDIAN90 operating system , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[10]  Robert Love,et al.  Linux Kernel Development , 2003 .

[11]  Wolfgang Mauerer,et al.  Professional Linux Kernel Architecture , 2008 .

[12]  Schahram Dustdar,et al.  A survey on self-healing systems: approaches and systems , 2010, Computing.

[13]  Ravishankar K. Iyer,et al.  An Architectural Framework for Detecting Process Hangs/Crashes , 2005, EDCC.

[14]  Haibo Chen,et al.  Why software hangs and what can be done with it , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).

[15]  Haoxiang Lin,et al.  Hang analysis: fighting responsiveness bugs , 2008, Eurosys '08.

[16]  Daniel Pierre Bovet,et al.  Understanding the Linux Kernel , 2000 .

[17]  Mark Sullivan,et al.  Software defects and their impact on system availability-a study of field failures in operating systems , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[18]  Ravishankar K. Iyer,et al.  Formalizing System Behavior for Evaluating a System Hang Detector , 2008, 2008 Symposium on Reliable Distributed Systems.

[19]  Domenico Cotroneo,et al.  Assessment and Improvement of Hang Detection in the Linux Operating System , 2009, 2009 28th IEEE International Symposium on Reliable Distributed Systems.

[20]  No License,et al.  Intel ® 64 and IA-32 Architectures Software Developer ’ s Manual Volume 3 A : System Programming Guide , Part 1 , 2006 .

[21]  Jean Arlat,et al.  Analysis of the effects of real and injected software faults: Linux as a case study , 2002, 2002 Pacific Rim International Symposium on Dependable Computing, 2002. Proceedings..

[22]  Jae-Young Pyun,et al.  Secure Multipath Routing Scheme for Mobile Ad Hoc Network , 2007 .

[23]  Ravishankar K. Iyer,et al.  Error sensitivity of the Linux kernel executing on PowerPC G4 and Pentium 4 processors , 2004, International Conference on Dependable Systems and Networks, 2004.

[24]  Ravishankar K. Iyer,et al.  Reliability MicroKernel: Providing Application-Aware Reliability in the OS , 2007, IEEE Transactions on Reliability.

[25]  Domenico Cotroneo,et al.  Operating system support to detect application hangs , 2008 .

[26]  John Dunagan,et al.  Why did my pc suddenly slow down , 2007 .

[27]  Carl E. Landwehr,et al.  Basic concepts and taxonomy of dependable and secure computing , 2004, IEEE Transactions on Dependable and Secure Computing.