A Process Health Status Service for Safety Related Systems Using TT/ET Communication Scheduling

This paper describes a health status protocol for distributed real-time systems that use TTCAN, Flexray, or other networks which support both time-triggered and event-triggered communication. The protocol allows a group of co-operating processes to establish a consistent view of each other¿s health status over time. It extends the instantaneous view, of operational status of each process, provided by a process group membership protocol. The health status and membership protocols are intended for systems where processes (not nodes) are considered the smallest unit of failure, and where process failures can be detected and recovered locally by the host node. Such systems require a decision function that determines whether a process failure is temporary (the process is being recovered by the host node) or permanent (local recovery is not possible or was unsuccessful). Our protocol ensures that such decisions are made consistently among correct nodes despite symmetrical and asymmetrical omission failures.

[1]  Cristian Constantinescu,et al.  Impact of deep submicron technology on dependability of VLSI circuits , 2002, Proceedings International Conference on Dependable Systems and Networks.

[2]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[3]  Flaviu Cristian,et al.  The Timed Asynchronous Distributed System Model , 1998, IEEE Trans. Parallel Distributed Syst..

[4]  Johan Karlsson,et al.  Flexible, Cost-EffectiveMembership Agreement in Synchronous Systems , 2006, 2006 12th Pacific Rim International Symposium on Dependable Computing (PRDC'06).

[5]  Christopher Temple,et al.  Avoiding the babbling-idiot failure in a time-triggered communication system , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[6]  T. Führer,et al.  Time Triggered Communication on CAN ( Time Triggered CAN-TTCAN ) , 2000 .

[7]  Johan Karlsson,et al.  A Process Group Membership Service for Active Safety Systems Using TT/ET Communication Scheduling , 2007, 13th Pacific Rim International Symposium on Dependable Computing (PRDC 2007).

[8]  Carl Bergenhem Survey of Membership Agreement Protocols , 2005 .

[9]  J. Karlsson,et al.  An Environment for Testing Safety-Critical Protocols , 2008 .

[10]  B. Hall,et al.  The real Byzantine Generals , 2004, The 23rd Digital Avionics Systems Conference (IEEE Cat. No.04CH37576).

[11]  Fred B. Schneider,et al.  Byzantine generals in action: implementing fail-stop processors , 1984, TOCS.

[12]  Paulo Veríssimo,et al.  The Delta-4 approach to dependability in open distributed computing systems , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[13]  Johan Karlsson,et al.  An Environment for Testing Safety-Critical Distributed Protocols , 2008 .

[14]  Florian Hartwich,et al.  Integration of Time Triggered CAN (TTCAN_TC) , 2002 .

[15]  Andrea Bondavalli,et al.  Threshold-Based Mechanisms to Discriminate Transient from Intermittent Faults , 2000, IEEE Trans. Computers.

[16]  Holger Zeltwanger,et al.  Time-Triggered Communication on CAN , 2002 .

[17]  F. Vasques,et al.  A group membership protocol for communication systems with both static and dynamic scheduling , 2006, 2006 IEEE International Workshop on Factory Communication Systems.

[18]  Günter Grünsteidl,et al.  TTP - A Protocol for Fault-Tolerant Real-Time Systems , 1994, Computer.

[19]  Tim Moors,et al.  Improving Email Reliability by Sender Retransmission , 2007 .

[20]  Rogério de Lemos,et al.  A robust group membership algorithm for distributed real-time systems , 1990, [1990] Proceedings 11th Real-Time Systems Symposium.

[21]  Neeraj Suri,et al.  A Tunable Add-On Diagnostic Protocol for Time-Triggered Systems , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).