FLARe: a Fault-tolerant Lightweight Adaptive Real-time middleware for distributed real-time and embedded systems

An important class of distributed real-time and embedded (DRE) applications consists of periodic soft real-time tasks. Timeliness and availability are essential requirements for the correct operation of these applications. Conventional solutions to these challenges tend to use non-adaptive and load-agnostic fault tolerance solutions within a real-time system, which often end up making ad hoc fault tolerance (e.g., failover targets) decisions that can further overload already strained resources. Potential adverse consequences of these ad hoc actions include excessive delays for real-time tasks and cascades of resource failures. This paper presents FLARe, which is a middleware that provides adaptive fault tolerance for DRE systems. FLARe's resource management infrastructure monitors various system metrics, including CPU utilization, and makes informed, load-aware, and adaptive decisions about the application's fault tolerance configurations (e.g., failover targets, physical placement of replicas). FLARe also employs decision making algorithms to adapt these decisions at runtime as faults occur and provides trade-offs between timeliness, availability, and performance as resources get overloaded, removed, or added.

[1]  Louis P. DiPalma,et al.  Towards Adaptive and Reflective Middleware For Network-Centric Combat Systems , 2001 .

[2]  David L. Black,et al.  An Architecture for Differentiated Service , 1998 .

[3]  Anees Shaikh,et al.  ARMADA Middleware and Communication Services , 1999, Real-Time Systems.

[4]  Joseph P. Loyall,et al.  Component-Based Dynamic QoS Adaptations in Distributed Real-Time and Embedded Systems , 2004, CoopIS/DOA/ODBASE.

[5]  Marco Caccamo,et al.  Task Partitioning with Replication upon Heterogeneous Multiprocessor Systems , 2006, 12th IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS'06).

[6]  David B. Stewart,et al.  Real-Time Scheduling of Sensor-Based Control Systems , 1991 .

[7]  David Powell,et al.  Distributed fault tolerance: lessons from Delta-4 , 1994, IEEE Micro.

[8]  Richard D. Schlichting,et al.  Fail-stop processors: an approach to designing fault-tolerant computing systems , 1983, TOCS.

[9]  Riccardo Bettati,et al.  Dynamic resource migration for multiparty real-time communication , 1996, Proceedings of 16th International Conference on Distributed Computing Systems.

[10]  Anees Shaikh,et al.  RTCAST: lightweight multicast for real-time process groups , 1996, Proceedings Real-Time Technology and Applications.

[11]  Brian A. Coan,et al.  Network QoS assurance in a multi-layer adaptive resource management scheme for mission-critical applications using the CORBA middleware framework , 2005, 11th IEEE Real Time and Embedded Technology and Applications Symposium.

[12]  Zheng Wang,et al.  An Architecture for Differentiated Services , 1998, RFC.

[13]  Krithi Ramamritham,et al.  Adaptive fault tolerance and graceful degradation under dynamic hard real-time scheduling , 1997, Proceedings Real-Time Systems Symposium.

[14]  Aad P. A. van Moorsel The 'QoS Query Service' for Improved Quality-of-Service Decision Making in CORBA , 1999, SRDS.

[15]  Louise E. Moser,et al.  Dynamic migration algorithms for distributed object systems , 2001, Proceedings 21st International Conference on Distributed Computing Systems.

[16]  K. H. Kim,et al.  The PSTR/SNS Scheme for Real-Time Fault Tolerance via Active Object Replication and Network Surveillance , 2000, IEEE Trans. Knowl. Data Eng..

[17]  Tei-Wei Kuo,et al.  Real-Time Task Replication for Fault Tolerance in Identical Multiprocessor Systems , 2007, 13th IEEE Real Time and Embedded Technology and Applications Symposium (RTAS'07).

[18]  Yennun Huang,et al.  A management interface for distributed fault tolerance CORBA services , 1998, Proceedings of the IEEE Third International Workshop on Systems Management.

[19]  Ferranti Computer Systems Limited,et al.  THE DELTA-4 EXTRA PERFORMANCE ARCHITECTURE (XPA) , 1990 .

[20]  Alain Girault,et al.  A bi-criteria scheduling heuristic for distributed embedded systems under reliability and real-time constraints , 2004, International Conference on Dependable Systems and Networks, 2004.

[21]  Tudor Dumitras,et al.  MEAD: support for Real‐Time Fault‐Tolerant CORBA , 2005, Concurr. Pract. Exp..

[22]  Rami G. Melhem,et al.  Enhancing real-time schedules to tolerate transient faults , 1995, Proceedings 16th IEEE Real-Time Systems Symposium.

[23]  John P. Lehoczky,et al.  Scalable resource allocation for multi-processor QoS optimization , 2003, 23rd International Conference on Distributed Computing Systems, 2003. Proceedings..

[24]  William H. Sanders,et al.  A dynamic replica selection algorithm for tolerating timing faults , 2001, 2001 International Conference on Dependable Systems and Networks.

[25]  Hermann Kopetz,et al.  Distributed fault-tolerant real-time systems: the Mars approach , 1989, IEEE Micro.

[26]  Kang G. Shin,et al.  A Fault-Tolerant Scheduling Algorithm for Real-Time Periodic Tasks with Possible Software Faults , 2003, IEEE Trans. Computers.

[27]  Jean-Charles Fabre,et al.  Implementing simple replication protocols using CORBA portable interceptors and Java serialization , 2004, International Conference on Dependable Systems and Networks, 2004.

[28]  Karsten Schwan,et al.  Utility-Driven Proactive Management of Availability in Enterprise-Scale Information Flows , 2006, Middleware.

[29]  Parameswaran Ramanathan,et al.  Overload Management in Real-Time Control Applications Using (m, k)-Firm Guarantee , 1999, IEEE Trans. Parallel Distributed Syst..

[30]  Aniruddha S. Gokhale,et al.  Middleware Support for Dynamic Component Updating , 2005, OTM Conferences.

[31]  Farnam Jahanian,et al.  A Real-Time Primary-Backup Replication Service , 1999, IEEE Trans. Parallel Distributed Syst..

[32]  Aniruddha S. Gokhale,et al.  MDDPro: Model-Driven Dependability Provisioning in Enterprise Distributed Real-Time and Embedded Systems , 2007, ISAS.

[33]  Krithi Ramamritham,et al.  Determining Redundancy Levels for Fault Tolerant Real-Time Systems , 1995, IEEE Trans. Computers.

[34]  John P. Lehoczky,et al.  The rate monotonic scheduling algorithm: exact characterization and average case behavior , 1989, [1989] Proceedings. Real-Time Systems Symposium.

[35]  Kang G. Shin,et al.  Load Sharing with Consideration of Future Task Arrivals in Heterogeneous Distributed Real-Time Systems , 1994, IEEE Trans. Computers.

[36]  Douglas C. Schmidt,et al.  Toward Adaptive and Reflective Middleware for Network-Centric Combat Systems , 2001 .

[37]  Nagarajan Kandasamy,et al.  Transparent recovery from intermittent faults in time-triggered distributed systems , 2003 .

[38]  Miguel Castro,et al.  Practical byzantine fault tolerance and proactive recovery , 2002, TOCS.

[39]  William H. Sanders,et al.  AQuA: An Adaptive Architecture that Provides Dependable Distributed Objects , 2003, IEEE Trans. Computers.

[40]  Qiaobing Xie,et al.  Stream control transmission protocol (SCTP): a reference guide , 2001 .

[41]  Douglas C. Schmidt,et al.  Evaluating meta-programming mechanisms for ORB middleware , 2001 .

[42]  Priya Narasimhan,et al.  Proactive recovery in distributed CORBA applications , 2004, International Conference on Dependable Systems and Networks, 2004.