We Crashed, Now What?

We present an in-depth analysis of the crash-recovery problem and propose a novel approach to recover from otherwise fatal operating system (OS) crashes. We show how an unconventional, but careful, OS design, aided by automatic compiler-based code instrumentation, offers a practical solution towards the survivability of the entire system. Current results are encouraging and show that our approach is able to recover even the most critical OS subsystems without exposing the failure to user applications or hampering the scalability of the system.

[1]  Michael Norrish,et al.  seL4: formal verification of an OS kernel , 2009, SOSP '09.

[2]  Dinakar Dhurjati,et al.  SAFECode: enforcing alias analysis for weakly typed languages , 2005, PLDI '06.

[3]  Christopher Strachey,et al.  A theory of programming language semantics , 1976 .

[4]  Wouter Joosen,et al.  PAriCheck: an efficient pointer arithmetic checker for C programs , 2010, ASIACCS '10.

[5]  van der Arjan Schaft,et al.  Systems and Networks , 1993 .

[6]  Brian N. Bershad,et al.  Recovering device drivers , 2004, TOCS.

[7]  Michael Stumm,et al.  Otherworld: giving applications a chance to survive OS kernel crashes , 2010, EuroSys '10.

[8]  Junfeng Yang,et al.  An empirical study of operating systems errors , 2001, SOSP.

[9]  Andrea C. Arpaci-Dusseau,et al.  Membrane: Operating system support for restartable file systems , 2010, TOS.

[10]  Herbert Bos,et al.  Failure Resilience for Device Drivers , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[11]  David A. Patterson,et al.  A Simple Way to Estimate the Cost of Downtime , 2002, LISA.

[12]  George C. Necula,et al.  SafeDrive: safe and recoverable extensions using language-based techniques , 2006, OSDI '06.

[13]  Andrew S. Tanenbaum,et al.  Operating systems: design and implementation , 1987, Prentice-Hall software series.

[14]  Roy H. Campbell,et al.  CuriOS: Improving Reliability through Operating System Structure , 2008, OSDI.

[15]  Dinakar Dhurjati,et al.  Enforcing Alias Analysis for Weakly Typed Languages , 2005 .

[16]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[17]  Elaine J. Weyuker,et al.  The distribution of faults in a large industrial software system , 2002, ISSTA '02.

[18]  Samuel T. King,et al.  Recovery domains: an organizing principle for recoverable operating systems , 2009, ASPLOS.