FlashBox: a system for logging non-deterministic events in deployed embedded systems

The ability to postmortem failures in deployed systems due to non-deterministic events is useful in crash investigations. With this goal in mind, we propose FlashBox - a system that acts as a black box for embedded systems, recording non-deterministic events (interrupts). The FlashBox hardware consists of a microcontroller and flash memory. The FlashBox software is an extension to a compiler, enabling recording capabilities at various granularities. There are no source code modifications required to use FlashBox and no assumptions made on processor capabilities such as hardware counters. The FlashBox log can be used for faithful replay with a goal to isolate faults and reason about failure. We present a prototype implementation of FlashBox that logs non-deterministic events on an AVR ATMega169 microcontroller. The FlashBox prototype consists of a 8051 microcontroller with flash memory. The avr-gcc compiler has been extended to log non-deterministic events. Based on our experimental results, FlashBox results in 10-23% overhead while providing capability to log non-deterministic events at instruction level granularity. With decreasing cost of flash memories, FlashBox provides a low cost logging mechanism. The use of standard I/O communication protocols enhances portability, enabling ease of integration for different classes of embedded systems.

[1]  Bernhard Plattner Real-Time Execution Monitoring , 1984, IEEE Transactions on Software Engineering.

[2]  Satish Narayanasamy,et al.  BugNet: Recording Application-Level Execution for Deterministic Replay Debugging , 2006, IEEE Micro.

[3]  Jens Palsberg,et al.  Nonintrusive precision instrumentation of microcontroller software , 2005, LCTES '05.

[4]  Samuel T. King,et al.  ReVirt: enabling intrusion analysis through virtual-machine logging and replay , 2002, OPSR.

[5]  John Regehr,et al.  Random testing of interrupt-driven software , 2005, EMSOFT.

[6]  Scott Shenker,et al.  Replay debugging for distributed applications , 2006 .

[7]  Thomas A. Cargill,et al.  Cheap hardware support for software debugging and profiling , 1987, ASPLOS.

[8]  George Lawton Improved flash memory grows in popularity , 2006, Computer.

[9]  Bill Moyer,et al.  A low power unified cache architecture providing power and performance flexibility , 2000, ISLPED'00: Proceedings of the 2000 International Symposium on Low Power Electronics and Design (Cat. No.00TH8514).

[10]  Samuel T. King,et al.  Debugging Operating Systems with Time-Traveling Virtual Machines (Awarded General Track Best Paper Award!) , 2005, USENIX Annual Technical Conference, General Track.

[11]  Scott Shenker,et al.  Replay Debugging for Distributed Applications (Awarded Best Paper!) , 2006, USENIX Annual Technical Conference, General Track.

[12]  Daniel Sundmark,et al.  Replay debugging of real-time systems using time machines , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[13]  Nancy G. Leveson,et al.  An investigation of the Therac-25 accidents , 1993, Computer.

[14]  E. N. Elnozahy,et al.  Support for Software Interrupts in Log-Based Rollback-Recovery , 1998, IEEE Trans. Computers.

[15]  Nancy G. Leveson,et al.  Software safety in embedded computer systems , 1991, CACM.

[16]  Srikanth Kandula,et al.  Flashback: A Lightweight Extension for Rollback and Deterministic Replay for Software Debugging , 2004, USENIX Annual Technical Conference, General Track.

[17]  Daniel Sundmark,et al.  Debugging Using Time Machines Replay Your Embedded Systems History , 2001 .

[18]  E. N. Elnozahy,et al.  Supporting nondeterministic execution in fault-tolerant systems , 1996, Proceedings of Annual Symposium on Fault Tolerant Computing.

[19]  Jens Palsberg,et al.  Static checking of interrupt-driven software , 2001, Proceedings of the 23rd International Conference on Software Engineering. ICSE 2001.

[20]  Henrik Thane,et al.  Monitoring, Testing and Debugging of Distributed Real-Time Systems , 2000 .

[21]  Jeffrey J. P. Tsai,et al.  A Noninterference Monitoring and Replay Mechanism for Real-Time Software Testing and Debugging , 1990, IEEE Trans. Software Eng..

[22]  Robert E. Strom,et al.  Optimistic recovery in distributed systems , 1985, TOCS.

[23]  Michael I. Jordan,et al.  Bug isolation via remote program sampling , 2003, PLDI.

[24]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[25]  Colin J. Fidge,et al.  Model checking interrupt-dependent software , 2005, 12th Asia-Pacific Software Engineering Conference (APSEC'05).

[26]  Yasushi Saito,et al.  Jockey: a user-space library for record-replay debugging , 2005, AADEBUG'05.

[27]  Yuanyuan Zhou,et al.  Rx: treating bugs as allergies---a safe method to survive software failures , 2005, SOSP '05.