Why software hangs and what can be done with it

Software hang is an annoying behavior and forms a major threat to the dependability of many software systems. To avoid software hang at the design phase or fix it in production runs, it is desirable to understand its characteristics. Unfortunately, to our knowledge, there is currently no comprehensive study on why software hangs and how to deal with it. In this paper, we study the reported hang-related bugs of four typical open-source software applications, aiming to gain insight into characteristics of software hang and provide some guidelines to fix them at the first place or remedy them in production runs.

[1]  Josep Torrellas,et al.  DeLorean: Recording and Deterministically Replaying Shared-Memory Multiprocessor Execution Ef?ciently , 2008, International Symposium on Computer Architecture.

[2]  Barton P. Miller,et al.  An empirical study of the reliability of UNIX utilities , 1990, Commun. ACM.

[3]  Srikanth Kandula,et al.  Flashback: A Lightweight Extension for Rollback and Deterministic Replay for Software Debugging , 2004, USENIX Annual Technical Conference, General Track.

[4]  Isil Dillig,et al.  Static error detection using semantic inconsistency inference , 2007, PLDI '07.

[5]  Junfeng Yang,et al.  An empirical study of operating systems errors , 2001, SOSP.

[6]  Dawson R. Engler,et al.  RacerX: effective, static detection of race conditions and deadlocks , 2003, SOSP '03.

[7]  Shan Lu,et al.  Flight data recorder: monitoring persistent-state interactions to improve systems management , 2006, OSDI '06.

[8]  Josep Torrellas,et al.  DeLorean: Recording and Deterministically Replaying Shared-Memory Multiprocessor Execution Ef?ciently , 2008, 2008 International Symposium on Computer Architecture.

[9]  Haoxiang Lin,et al.  Hang analysis: fighting responsiveness bugs , 2008, Eurosys '08.

[10]  Ravishankar K. Iyer,et al.  An Architectural Framework for Detecting Process Hangs/Crashes , 2005, EDCC.

[11]  Tong Li,et al.  Pulse: A Dynamic Deadlock Detection Mechanism Using Speculative Execution , 2005, USENIX Annual Technical Conference, General Track.

[12]  Timothy Roscoe,et al.  30 seconds is not enough!: a study of operating system timer usage , 2008, Eurosys '08.

[13]  Yuanyuan Zhou,et al.  Learning from mistakes: a comprehensive study on real world concurrency bug characteristics , 2008, ASPLOS.

[14]  Archana Ganapathi,et al.  Why Do Internet Services Fail, and What Can Be Done About It? , 2002, USENIX Symposium on Internet Technologies and Systems.