IGOR: Accelerating Byzantine Fault Tolerance for Real-Time Systems with Eager Execution

Critical real-time systems like spacecraft and aircraft commonly use Byzantine fault-tolerant (BFT) state machine replication (SMR) to mask faulty processors and sensors. Unfortunately, existing BFT SMR techniques require replicas to reach agreement on redundant sensor data and perform source selection before executing, which adds unavoidable latency to every execution and inevitably decreases control performance. The standard way to reduce the latency of BFT SMR in nonreal-time systems is to use speculation, forgoing agreement on inputs altogether, and repeating executions when faults occur. However, this approach is not suitable for real-time systems, since its worst-case latency when faults occur can be even higher than that of non-speculative systems. This paper presents IGOR, a new speculative BFT SMR approach that leverages multi-core processors to avoid the added latency inherent to traditional BFT SMR techniques in both the absence and presence of faults. The key idea of IGOR is to eagerly execute on data from redundant sensors in parallel. While these executions are underway, the replicas reach agreement on which execution will determine the system’s final state. As a result, IGOR’S latency is reduced to the time taken by the executions or by the agreement process, whichever is longer. Our evaluation shows that IGOR reduces latency by up to $ 1.75\times$ and improves schedulability by $ 1.88-3.22\times$ compared to the state of the art. We also show that when used to execute spacecraft guidance, navigation, and control software during a dynamic mission phase, IGOR noticeably increases vehicle stability.

[1]  Martin Hirt,et al.  Perfectly-Secure MPC with Linear Communication Complexity , 2008, TCC.

[2]  Juan A. Garay,et al.  Efficient Distributed Consensus with n = (3 + epsilon) t Processors (Extended Abstract) , 1991, WDAG.

[3]  Marko Vukolic,et al.  The Next 700 BFT Protocols , 2015, ACM Trans. Comput. Syst..

[4]  Yoram Moses,et al.  Fully polynomial Byzantine agreement in t + 1 rounds , 1993, STOC.

[5]  R. Hammett Ultra-reliable real-time control systems-future trends , 1998 .

[6]  André Schiper,et al.  Optimistic Atomic Broadcast , 1998, DISC.

[7]  Fernando Pedone,et al.  High performance state-machine replication , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN).

[8]  J. Goldberg,et al.  SIFT: Design and analysis of a fault-tolerant computer for aircraft control , 1978, Proceedings of the IEEE.

[9]  Nitin H. Vaidya,et al.  Experimental performance comparison of Byzantine Fault-Tolerant protocols for data centers , 2012, 2012 Proceedings IEEE INFOCOM.

[10]  Glenn Rakow,et al.  Human Mars lander design for NASA's evolvable mars campaign , 2016, 2016 IEEE Aerospace Conference.

[11]  Achour Mostéfaoui,et al.  Synchronous byzantine agreement with nearly a cubic number of communication bits: synchronous byzantine agreement with nearly a cubic number of communication bits , 2013, PODC '13.

[12]  Miguel Oom Temudo de Castro,et al.  Practical Byzantine fault tolerance , 1999, OSDI '99.

[13]  Jacob R. Lorch,et al.  TrInc: Small Trusted Hardware for Large Distributed Systems , 2009, NSDI.

[14]  Ashish Choudhury,et al.  Multi-valued Asynchronous Reliable Broadcast with a Strict Honest Majority , 2017, ICDCN.

[16]  Christian Fraboul,et al.  Dimensioning of Civilian Avionics Networks , 2014 .

[17]  Christoph Lenzen,et al.  Self-stabilizing Byzantine Clock Synchronization with Optimal Precision , 2016, SSS.

[18]  Marc Boyer,et al.  Performance impact of the interactions between time -triggered and rate-constrained transmissions in TTEthernet , 2016 .

[19]  Ramakrishna Kotla,et al.  High throughput Byzantine fault tolerance , 2004, International Conference on Dependable Systems and Networks, 2004.

[20]  Scott Shenker,et al.  Attested append-only memory: making adversaries stick to their word , 2007, SOSP.

[21]  Robert L. Hirsh,et al.  Requirements-based execution time prediction of a partitioned real-time system using I/O and SLOC estimates , 2012, Innovations in Systems and Software Engineering.

[22]  Roger M. Kieckhafer,et al.  Exploiting Omissive Faults in Synchronous Approximate Agreement , 2000, IEEE Trans. Computers.

[23]  R. Makowitz,et al.  Flexray - A communication network for automotive control systems , 2006, 2006 IEEE International Workshop on Factory Communication Systems.

[24]  Sam Toueg,et al.  Resilient consensus protocols , 1983, PODC '83.

[25]  Andrew T. Loveless On TTEthernet for Integrated Fault-Tolerant Spacecraft Networks , 2015 .

[26]  Wesley A. Powell High-Performance Spaceflight Computing (HPSC) Program Overview , 2018 .

[27]  Richard Zurawski,et al.  The Industrial Communication Technology Handbook , 2005 .

[28]  Daniel E. Stine Digital signatures for a Byzantine resilient computer system , 1995 .

[29]  Nancy A. Lynch,et al.  A Lower Bound for the Time to Assure Interactive Consistency , 1982, Inf. Process. Lett..

[30]  Roberto Palmieri,et al.  Archie: a speculative replicated transactional system , 2014, Middleware.

[31]  Philip Koopman,et al.  Coverage and the use of cyclic redundancy codes in ultra-dependable systems , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[32]  Roman Obermaisser Time-Triggered Communication , 2009, Networked Embedded Systems.

[33]  Lewis Tseng,et al.  Byzantine Broadcast Under a Selective Broadcast Model for Single-hop Wireless Networks , 2015, ArXiv.

[34]  Marko Vukolic,et al.  Hyperledger fabric: a distributed operating system for permissioned blockchains , 2018, EuroSys.

[35]  Jörn Migge,et al.  Timing verification of real­time automotive Ethernet networks: what can we expect from simulation? , 2016 .

[36]  Robert Bosch,et al.  Comparison of Event-Triggered and Time-Triggered Concepts with Regard to Distributed Control Systems , 2004 .

[37]  Benjamin Ip Performance Analysis of VxWorks and RTLinux , 2001 .

[38]  Nitin H. Vaidya,et al.  Error-free multi-valued consensus with byzantine failures , 2011, PODC '11.

[39]  Theodore P. Baker,et al.  The cyclic executive model and Ada , 2006, Real-Time Systems.

[40]  Jaynarayan H. Lala,et al.  FAULT-TOLERANT PARALLEL PROCESSOR , 1991 .

[41]  Johannes Behl,et al.  CheapBFT: resource-efficient byzantine fault tolerance , 2012, EuroSys '12.

[42]  C. Pandu Rangan,et al.  Communication Optimal Multi-valued Asynchronous Byzantine Agreement with Optimal Resilience , 2011, ICITS.

[43]  Arpita Patra,et al.  Error-free Multi-valued Broadcast and Byzantine Agreement with Optimal Communication Complexity , 2011, OPODIS.

[44]  Michael Paulitsch,et al.  Time-Triggered Ethernet , 2014 .

[45]  Christopher C. Marchant Ares I Avionics Introduction , 2009 .

[46]  Yuan Chen,et al.  Heavy Lift Vehicle (Hlv) Avionics Flight Computing Architecture Study , 2013 .

[47]  Andrew Loveless Notional 1FT Voting Architecture with Time-Triggered Ethernet , 2016 .

[48]  Achour Mostéfaoui,et al.  Signature-Free Broadcast-Based Intrusion Tolerance: Never Decide a Byzantine Value , 2010, OPODIS.

[49]  Françoise Simonot-Lion,et al.  Design of automotive X-by-Wire systems , 2005 .

[50]  Arpita Patra,et al.  Broadcast Extensions with Optimal Communication and Round Complexity , 2016, PODC.

[51]  Sam Toueg,et al.  Randomized Byzantine Agreements , 1984, PODC '84.

[52]  Alexander Klein,et al.  The Evolution of Avionics Networks From ARINC 429 to AFDX , 2012 .

[53]  Jialin Li,et al.  Designing Distributed Systems Using Approximate Synchrony in Data Center Networks , 2015, NSDI.

[54]  John F. Hanaway,et al.  Space shuttle avionics system , 1989 .

[55]  Steve Parkes,et al.  SpaceFibre networks: SpaceFibre, long paper , 2016, 2016 International SpaceWire Conference (SpaceWire).

[56]  Marc Boyer,et al.  Insights on the Performance and Configuration of AVB and TSN in Automotive Ethernet Networks , 2018 .

[57]  Janise McNair,et al.  A test bed study of network determinism for heterogeneous traffic using time-triggered ethernet , 2017, MILCOM 2017 - 2017 IEEE Military Communications Conference (MILCOM).

[58]  Lorraine E. Prokop,et al.  NASA's Core Flight Software - A Reusable Real-Time Framework , 2014 .

[59]  Brian A. Coan,et al.  Modular Construction of a Byzantine Agreement Protocol with Optimal Message Bit Complexity , 1992, Inf. Comput..

[60]  R. Hammett,et al.  Automatic Performance Monitoring Enhances Seawolf Submarine Ship Control Maintainability , 1998 .

[61]  Philip Koopman,et al.  Data Network Evaluation Criteria Handbook , 2009 .

[62]  Melinda Y Tang,et al.  Wireless reconfigurability of fault-tolerant processing systems , 2008 .

[63]  J. T. Sims,et al.  The Byzantine Generals Problem , 1982, TOPL.

[64]  Ronald G. Dreslinski,et al.  Optimal and Error-Free Multi-Valued Byzantine Consensus Through Parallel Execution , 2020, IACR Cryptol. ePrint Arch..

[65]  Coy Kouba,et al.  The X-38 Spacecraft Fault-Tolerant Avionics System , 2003 .

[66]  Christian Fidi,et al.  A Proposed Byzantine Fault-Tolerant Voting Architecture using Time-Triggered Ethernet , 2017 .

[67]  C. Pandu Rangan,et al.  Communication Optimal Multi-valued Asynchronous Broadcast Protocol , 2010, LATINCRYPT.

[68]  Danny Dolev,et al.  Shifting gears: changing algorithms on the fly to expedite Byzantine agreement , 1987, PODC '87.

[69]  David Chi-Shing Chau,et al.  Authenticated Messages for a Real-Time Fault-Tolerant Computer System , 2006 .

[70]  André Schiper,et al.  Achieving High-Throughput State Machine Replication in Multi-core Systems , 2013, 2013 IEEE 33rd International Conference on Distributed Computing Systems.

[71]  Matthias Fitzi,et al.  Optimally efficient multi-valued byzantine agreement , 2006, PODC '06.

[72]  Aniket Kate,et al.  On the (limited) power of non-equivocation , 2012, PODC '12.

[73]  B. Hall,et al.  The real Byzantine Generals , 2004, The 23rd Digital Avionics Systems Conference (IEEE Cat. No.04CH37576).

[74]  Fernando Pedone,et al.  Checkpointing in Parallel State-Machine Replication , 2014, OPODIS.

[75]  Yang Wang,et al.  All about Eve: Execute-Verify Replication for Multi-Core Servers , 2012, OSDI.

[76]  Karl N. Levitt,et al.  hBFT: Speculative Byzantine Fault Tolerance with Minimum Cost , 2015, IEEE Transactions on Dependable and Secure Computing.

[77]  Arpita Patra,et al.  Optimal extension protocols for byzantine broadcast and agreement , 2020, Distributed Computing.

[78]  Alan D. George,et al.  Comparative Analysis of Present and Future Space-Grade Processors with Device Metrics , 2017, J. Aerosp. Inf. Syst..

[79]  Sam Toueg,et al.  A Modular Approach to Fault-Tolerant Broadcasts and Related Problems , 1994 .

[80]  Bernd Wolff,et al.  ``DMS-R, the Brain of the ISS'', 10 Years of Continuous Successful Operation in Space , 2012 .

[81]  J. H. Lala,et al.  Architectural principles for safety-critical real-time applications , 1994, Proc. IEEE.

[82]  Piotr Berman,et al.  Bit optimal distributed consensus , 1992 .

[83]  David McComas,et al.  NASA/GSFC's Flight Software Core Flight System , 2013 .

[84]  Danny Dolev,et al.  The Byzantine Generals Strike Again , 1981, J. Algorithms.

[85]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[86]  Fernando Pedone,et al.  Rethinking State-Machine Replication for Parallelism , 2013, 2014 IEEE 34th International Conference on Distributed Computing Systems.

[87]  Dong Zhou,et al.  Rex: replication at the speed of multi-core , 2014, EuroSys '14.

[88]  Silviu S. Craciunas,et al.  Breaking vs. solving: analysis and routing of real-time networks with cyclic dependencies using network calculus , 2019, RTNS '19.

[89]  Ashish Choudhury,et al.  Asynchronous MPC with a strict honest majority using non-equivocation , 2014, PODC '14.

[90]  Ramakrishna Kotla,et al.  Zyzzyva , 2007, SOSP.

[91]  Miguel Correia,et al.  Asynchronous Byzantine consensus with 2f+1 processes , 2010, SAC '10.

[92]  Rafail Ostrovsky,et al.  Information-Theoretic Broadcast with Dishonest Majority for Long Messages , 2018, IACR Cryptol. ePrint Arch..

[93]  Håkan Sivencrona,et al.  Byzantine Fault Tolerance, from Theory to Reality , 2003, SAFECOMP.

[94]  Jean-Baptiste Chaudron,et al.  Real-time distributed simulations in an HLA framework: Application to aircraft simulation , 2014, Simul..

[95]  Cary R Spitzer,et al.  The avionics handbook , 2001 .

[96]  Jeremy H. Brown,et al.  How fast is fast enough ? Choosing between Xenomai and Linux for real-time applications , 2010 .

[97]  Victor Shoup,et al.  Secure and Efficient Asynchronous Broadcast Protocols , 2001, CRYPTO.

[98]  Michael K. Reiter,et al.  Zzyzx: Scalable fault tolerance through Byzantine locking , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).

[100]  Kartik Nayak,et al.  Improved Extension Protocols for Byzantine Broadcast and Agreement , 2020, DISC.

[101]  P.博布雷克 Avionics full-duplex switched ethernet network , 2013 .

[102]  Martin Hirt,et al.  Multi-valued Byzantine Broadcast: The t < n Case , 2014, ASIACRYPT.

[103]  P. Lincoln,et al.  Byzantine Agreement with Authentication : Observations andApplications in Tolerating Hybrid and Link Faults , 1995 .

[104]  Jérôme Ermont,et al.  End-to-end latency and temporal consistency analysis in networked real-time systems , 2014, Int. J. Crit. Comput. Based Syst..

[105]  Wilfredo Torres-Pomales,et al.  Robus-2: A Fault-Tolerant Broadcast Communication System , 2013 .

[106]  Wenjing Lou,et al.  Distributed Consensus Protocols and Algorithms , 2019, Blockchain for Distributed Systems Security.

[107]  Brian A. Coan,et al.  Extending Binary Byzantine Agreement to Multivalued Byzantine Agreement , 1984, Inf. Process. Lett..

[108]  Yoram Moses,et al.  Coordinated traversal: (t+1)-round Byzantine agreement in polynomial time , 1988, [Proceedings 1988] 29th Annual Symposium on Foundations of Computer Science.

[109]  Gustavo Alonso,et al.  Processing transactions over optimistic atomic broadcast protocols , 1999, Proceedings. 19th IEEE International Conference on Distributed Computing Systems (Cat. No.99CB37003).

[110]  Gustavo Alonso,et al.  Improving the scalability of fault-tolerant database clusters , 2002, Proceedings 22nd International Conference on Distributed Computing Systems.

[111]  Petr Kuznetsov,et al.  Zeno: Eventually Consistent Byzantine-Fault Tolerance , 2009, NSDI.