On-the-fly healing of race conditions in ARINC-653 flight software

The ARINC-653 standard architecture for flight software specifies an application executive (APEX) which provides an application programming interface and defines a hierarchical framework which provides health management for error detection and recovery. In every partition of the architecture, however, asynchronously concurrent processes or threads may include concurrency bugs such as unintended race conditions which are common and difficult to remove by testing. A race condition toward a shared data, or data race, is a pair of unsynchronized instructions that access a shared variable with at least one write access. Data races threaten the reliability of shared-memory programs seriously and latently, because they result in unintended nondeterministic executions of the programs. To heal data race during executions of ARINC-653 flight software, this paper instruments on-the-fly race detection into the target program and incorporates on-the-fly race healing into the health management of the ARINC-653 architecture. The race detection signals to the health monitor using the corresponding APEX call, if a data race is detected. The health monitor then responds by invoking an aperiodic, user-defined, error handling process that is assigned the highest possible priority. This special process uses an APEX call to identify and then heals the occurrence of race condition as an application error, one of seven error types defined by ARINC-653. This race-healing process allows the target programs to be assured at run-time that the execution result of the healed program could have been in the original program and therefore no new functional bug has been introduced. This paper evaluates efficiencies of the on-the-fly mechanisms to argue that they are practical to be configured under the ARINC-653 partitions.

[1]  Walter F. Tichy,et al.  On-the-fly race detection in multi-threaded programs , 2008, PADTAD '08.

[2]  Tomás Vojnar,et al.  Healing data races on-the-fly , 2007, PADTAD '07.

[3]  Kern Koh,et al.  On-the-fly detection of access anomalies in nested parallel loops , 1993, PADD '93.

[4]  S. Santos,et al.  A portable ARINC 653 standard interface , 2008, 2008 IEEE/AIAA 27th Digital Avionics Systems Conference.

[5]  Darko Kirovski,et al.  Detecting and tolerating asymmetric races , 2009, PPoPP '09.

[6]  Matthew Barry,et al.  Prototype Implementation of a Goal-Based Software Health Management Service , 2009, 2009 Third IEEE International Conference on Space Mission Challenges for Information Technology.

[7]  Zhiqiang Ma,et al.  A theory of data race detection , 2006, PADTAD '06.

[8]  Peter H. Schmitt,et al.  Enhanced Dispatchability of Aircrafts using Multi-Static Configurations , 2010 .

[9]  G. Horvath,et al.  Software Fault Protection with ARINC 653 , 2007, 2007 IEEE Aerospace Conference.

[10]  Edith Schonberg,et al.  On-the-fly detection of access anomalies , 2018, PLDI '89.

[11]  Edith Schonberg,et al.  Detecting access anomalies in programs with critical sections , 1991, PADD '91.

[12]  Colin J. Fidge,et al.  Logical time in distributed computing systems , 1991, Computer.

[13]  Tomás Vojnar,et al.  AtomRace: data race and atomicity violation detector and healer , 2008, PADTAD '08.

[14]  Barton P. Miller,et al.  What are race conditions?: Some issues and formalizations , 1992, LOPL.

[15]  Michel Raynal,et al.  Fundamentals of Distributed Computing: A Practical Tour of Vector Clock Systems , 2002, IEEE Distributed Syst. Online.

[16]  P.J. Prisaznuk,et al.  ARINC 653 role in Integrated Modular Avionics (IMA) , 2008, 2008 IEEE/AIAA 27th Digital Avionics Systems Conference.