Fault Tolerance and the Five-Second Rule

We propose a new approach to fault tolerance that we call bounded-time recovery (BTR). BTR is intended for systems that need strong timeliness guarantees during normal operation but can tolerate short outages in an emergency, e.g., when they are under attack. We argue that BTR could be a good fit for many cyber-physical systems. We also sketch a technical approach to providing BTR, and we discuss some challenges that still remain.

[1]  Kurt Keutzer,et al.  Scheduling task dependence graphs with variable task execution times onto heterogeneous multiprocessors , 2008, EMSOFT '08.

[2]  Miguel Castro,et al.  Practical byzantine fault tolerance and proactive recovery , 2002, TOCS.

[3]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[4]  Wilson C. Hsieh,et al.  Spanner , 2012, OSDI.

[5]  Shlomi Dolev,et al.  Self Stabilization , 2004, J. Aerosp. Comput. Inf. Commun..

[6]  Ralph Langner,et al.  Stuxnet: Dissecting a Cyberwarfare Weapon , 2011, IEEE Security & Privacy.

[7]  Michael Dahlin,et al.  Making Byzantine Fault Tolerant Systems Tolerate Byzantine Faults , 2009, NSDI.

[8]  Arun Venkataramani,et al.  ZZ and the art of practical BFT execution , 2011, EuroSys '11.

[9]  Andreas Haeberlen,et al.  The Fault Detection Problem , 2009, OPODIS.

[10]  S. Shankar Sastry,et al.  Safe and Secure Networked Control Systems under Denial-of-Service Attacks , 2009, HSCC.

[11]  John Lane,et al.  Prime: Byzantine Replication under Attack , 2011, IEEE Transactions on Dependable and Secure Computing.

[12]  Edsger W. Dijkstra,et al.  Self-stabilizing systems in spite of distributed control , 1974, CACM.

[13]  Johannes Behl,et al.  CheapBFT: resource-efficient byzantine fault tolerance , 2012, EuroSys '12.

[14]  André Schiper,et al.  Bounded Delay in Byzantine-Tolerant State Machine Replication , 2013, 2013 IEEE 32nd International Symposium on Reliable Distributed Systems.

[15]  Michael P. Wellman Methods for Empirical Game-Theoretic Analysis , 2006, AAAI.

[16]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[17]  Ramakrishna Kotla,et al.  Zyzzyva: speculative byzantine fault tolerance , 2007, TOCS.

[18]  Hermann Kopetz,et al.  Distributed fault-tolerant real-time systems: the Mars approach , 1989, IEEE Micro.

[19]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[20]  Nickolai Zeldovich,et al.  Intrusion recovery for database-backed web applications , 2011, SOSP.

[21]  S. Vestal Preemptive Scheduling of Multi-criticality Systems with Varying Degrees of Execution Time Assurance , 2007, RTSS 2007.

[22]  Danny Dolev,et al.  Fast self-stabilizing byzantine tolerant digital clock synchronization , 2008, PODC '08.

[23]  J. Elson,et al.  Fine-grained network time synchronization using reference broadcasts , 2002, OSDI '02.

[24]  Youmin Zhang,et al.  Bibliographical review on reconfigurable fault-tolerant control systems , 2003, Annu. Rev. Control..

[25]  George Candea,et al.  Microreboot - A Technique for Cheap Recovery , 2004, OSDI.

[26]  Marcos K. Aguilera,et al.  Detecting failures in distributed systems with the Falcon spy network , 2011, SOSP.

[27]  Jill Slay,et al.  Lessons Learned from the Maroochy Water Breach , 2007, Critical Infrastructure Protection.

[28]  Insup Lee,et al.  Co-design of control and platform with dropped signals , 2013, 2013 ACM/IEEE International Conference on Cyber-Physical Systems (ICCPS).

[29]  Guoliang Xing,et al.  PTEC: A System for Predictive Thermal and Energy Control in Data Centers , 2014, 2014 IEEE Real-Time Systems Symposium.

[30]  Bruno Sinopoli,et al.  Secure control against replay attacks , 2009, 2009 47th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[31]  Aniruddha S. Gokhale,et al.  Middleware for Resource-Aware Deployment and Configuration of Fault-Tolerant Real-time Systems , 2010, 2010 16th IEEE Real-Time and Embedded Technology and Applications Symposium.

[32]  Sang Hyuk Son,et al.  Achieving bounded and predictable recovery using real-time logging , 2002, Proceedings. Eighth IEEE Real-Time and Embedded Technology and Applications Symposium.

[33]  Marcos K. Aguilera,et al.  No Time for Asynchrony , 2009, HotOS.

[34]  Nickolai Zeldovich,et al.  Asynchronous intrusion recovery for interconnected web services , 2013, SOSP.

[35]  P. Ramanathan,et al.  Deadlines , 2019, PodoPost.

[36]  Danny Dolev,et al.  Self-stabilization of Byzantine Protocols , 2005, Self-Stabilizing Systems.

[37]  Jennifer L. Welch,et al.  Self-Stabilizing Clock Synchronization in the Presence of ByzantineFaults ( Preliminary Version ) Shlomi Dolevy , 1995 .

[38]  Matti Valovirta,et al.  Experimental Security Analysis of a Modern Automobile , 2011 .

[39]  Long Wang,et al.  Stabilization of Networked Control Systems with Data Packet Dropout and Transmission Delays: Continuous-Time Case , 2005, Eur. J. Control.

[40]  S. Shankar Sastry,et al.  Rethinking security properties, threat models, and the design space in sensor networks: A case study in SCADA systems , 2009, Ad Hoc Networks.

[41]  S. Shankar Sastry,et al.  Secure Control: Towards Survivable Cyber-Physical Systems , 2008, 2008 The 28th International Conference on Distributed Computing Systems Workshops.

[42]  Liuba Shrira,et al.  HQ replication: a hybrid quorum protocol for byzantine fault tolerance , 2006, OSDI '06.

[43]  Ufuk Topcu,et al.  Receding horizon temporal logic planning for dynamical systems , 2009, Proceedings of the 48h IEEE Conference on Decision and Control (CDC) held jointly with 2009 28th Chinese Control Conference.

[44]  Timothy Grance,et al.  Guide to Supervisory Control and Data Acquisition (SCADA) and Other Industrial Control System Security , 2006 .

[45]  Andreas Haeberlen,et al.  Detecting Covert Timing Channels with Time-Deterministic Replay , 2014, OSDI.

[46]  Insup Lee,et al.  A Semantic Framework for Mode Change Protocols , 2011, 2011 17th IEEE Real-Time and Embedded Technology and Applications Symposium.

[47]  Yingwei Luo,et al.  Failure Recovery: When the Cure Is Worse Than the Disease , 2013, HotOS.

[48]  R. P. G. Collinson,et al.  Introduction to Avionics Systems , 2003 .

[49]  Giuseppe Buja,et al.  Overcoming Babbling-Idiot Failures in CAN Networks: A Simple and Effective Bus Guardian Solution for the FlexCAN Architecture , 2007, IEEE Transactions on Industrial Informatics.

[50]  Mahyar R. Malekpour,et al.  A Byzantine-Fault Tolerant Self-stabilizing Protocol for Distributed Clock Synchronization Systems , 2006, SSS.

[51]  Danny Dolev,et al.  Byzantine Self-stabilizing Pulse in a Bounded-Delay Model , 2007, SSS.

[52]  Marko Vukolic,et al.  The Next 700 BFT Protocols , 2015, ACM Trans. Comput. Syst..

[53]  Andreas Haeberlen,et al.  PeerReview: practical accountability for distributed systems , 2007, SOSP.

[54]  Michael K. Reiter,et al.  Fault-scalable Byzantine fault-tolerant services , 2005, SOSP '05.

[55]  Hermann Kopetz,et al.  The time-triggered architecture , 1998, Proceedings First International Symposium on Object-Oriented Real-Time Distributed Computing (ISORC '98).

[56]  Atul Singh,et al.  BFT Protocols Under Fire , 2008, NSDI.

[57]  Paulo Tabuada,et al.  Secure Estimation and Control for Cyber-Physical Systems Under Adversarial Attacks , 2012, IEEE Transactions on Automatic Control.

[58]  George Candea,et al.  Crash-Only Software , 2003, HotOS.

[59]  Robert I. Davis,et al.  Mixed Criticality Systems - A Review , 2015 .

[60]  Parameswaran Ramanathan,et al.  A Dynamic Priority Assignement Technique for Streams with (m, k)-Firm Deadlines , 1995, IEEE Trans. Computers.

[61]  Louise E. Moser,et al.  Byzantine Fault Detectors for Solving Consensus , 2003, Comput. J..

[62]  Karl Henrik Johansson,et al.  Attack models and scenarios for networked control systems , 2012, HiCoNS '12.

[63]  Michael P. Wellman,et al.  Empirical Game-Theoretic Analysis of an Adaptive Cyber-Defense Scenario (Preliminary Report) , 2014, GameSec.

[64]  Alfons Crespo,et al.  Mode Change Protocols for Real-Time Systems: A Survey and a New Proposal , 2004, Real-Time Systems.

[65]  Danny Dolev,et al.  Self-stabilizing Byzantine Digital Clock Synchronization , 2006, SSS.

[66]  Anthony Rowe,et al.  Hardware Assisted Clock Synchronization for Real-Time Sensor Networks , 2013, 2013 IEEE 34th Real-Time Systems Symposium.