Comprehensive and Efficient Runtime Checking in System Software through Watchdogs

Systems software today is composed of numerous modules and exhibits complex failure modes. Existing failure detectors focus on catching simple, complete failures and treat programs uniformly at the process level. In this paper, we argue that modern software needs intrinsic failure detectors that are tailored to individual systems and can detect anomalies within a process at finer granularity. We particularly advocate a notion of intrinsic software watchdogs and propose an abstraction for it. Among the different styles of watchdogs, we believe watchdogs that imitate the main program can provide the best combination of completeness, accuracy and localization for detecting gray failures. But, manually constructing such mimic-type watchdogs is challenging and time-consuming. To close this gap, we present an early exploration for automatically generating mimic-type watchdogs.

[1]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[2]  Marcos K. Aguilera,et al.  Improving Availability in Distributed Systems with Failure Informers , 2013, NSDI.

[3]  Marcos K. Aguilera,et al.  Taming uncertainty in distributed systems with help from the network , 2015, EuroSys.

[4]  Andrea C. Arpaci-Dusseau,et al.  Fail-stutter fault tolerance , 2001, Proceedings Eighth Workshop on Hot Topics in Operating Systems.

[5]  George Candea,et al.  Failure sketching: a technique for automated root cause diagnosis of in-production failures , 2015, SOSP.

[6]  Tanakorn Leesatapornwongsa,et al.  Limplock: understanding the impact of limpware on scale-out cloud systems , 2013, SoCC.

[7]  Martín Abadi,et al.  Control-flow integrity , 2005, CCS '05.

[8]  Arnold Berger,et al.  Embedded Systems Design: An Introduction to Processes, Tools, and Techniques , 2001 .

[9]  Ding Yuan,et al.  Pensieve: Non-Intrusive Failure Reproduction for Distributed Systems using the Event Chaining Approach , 2017, SOSP.

[10]  Marcos K. Aguilera,et al.  Detecting failures in distributed systems with the Falcon spy network , 2011, SOSP.

[11]  Peng Huang,et al.  13th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2018, Carlsbad, CA, USA, October 8-10, 2018 , 2018, OSDI.

[12]  Miguel Correia,et al.  Practical Hardening of Crash-Tolerant Systems , 2012, USENIX Annual Technical Conference.

[13]  Ashutosh Gupta,et al.  InvGen: An Efficient Invariant Generator , 2009, CAV.

[14]  Laurie Hendren,et al.  Soot: a Java bytecode optimization framework , 2010, CASCON.

[15]  Edward J. McCluskey,et al.  Concurrent Error Detection Using Watchdog Processors - A Survey , 1988, IEEE Trans. Computers.

[16]  David W. Binkley,et al.  Program slicing , 2008, 2008 Frontiers of Software Maintenance.

[17]  Niall Murphy,et al.  Site Reliability Engineering: How Google Runs Production Systems , 2016 .

[18]  Yuanyuan Zhou,et al.  Early Detection of Configuration Errors to Reduce Failure Damage , 2016, USENIX Annual Technical Conference.

[19]  William G. Griswold,et al.  Dynamically discovering likely program invariants to support program evolution , 1999, Proceedings of the 1999 International Conference on Software Engineering (IEEE Cat. No.99CB37002).

[20]  George Candea,et al.  Microreboot - A Technique for Cheap Recovery , 2004, OSDI.

[21]  Peng Huang,et al.  Gray Failure: The Achilles' Heel of Cloud-Scale Systems , 2017, HotOS.

[22]  Robert B. Ross,et al.  Fail-Slow at Scale , 2018, ACM Trans. Storage.

[23]  Marcos K. Aguilera,et al.  No Time for Asynchrony , 2009, HotOS.

[24]  Jeffrey C. Mogul,et al.  Thinking about Availability in Large Service Infrastructures , 2017, HotOS.

[25]  Andrea C. Arpaci-Dusseau,et al.  IRON file systems , 2005, SOSP '05.