A novel fault-tolerant scheduling algorithm for precedence constrained tasks in real-time heterogeneous systems

Fault-tolerance is an essential requirement for real-time systems, due to potentially catastrophic consequences of faults. In this paper, we investigate an efficient off-line scheduling algorithm generating schedules in which real-time tasks with precedence constraints can tolerate one processor's permanent failure in a heterogeneous system with fully connected network. The tasks are assumed to be non-preemptable, and each task has two copies scheduled on different processors and mutually excluded in time. In the literature in recent years, the quality of a schedule has been previously improved by allowing a backup copy to overlap with other backup copies on the same processor. However, this approach assumes that tasks are independent of one other. To meet the needs of real-time systems where tasks have precedence constraints, a new overlapping scheme is proposed. We show that, given two tasks, the necessary conditions for their backup copies to safely overlap in time with each other are (1) their corresponding primary copies are scheduled on two different processors, (2) they are independent tasks, and (3) the execution of their backup copies implies the failures of the processors on which their primary copies are scheduled. For tasks with precedence constraints, the new overlapping scheme allows the backup copy of a task to overlap with its successors' primary copies, thereby further reducing schedule length. Based on a proposed reliability model, tasks are judiciously allocated to processors so as to maximize the reliability of heterogeneous systems. Additionally, times for detecting and handling of a permanent fault are incorporated into the scheduling scheme. We have performed experiments using synthetic workloads as well as a real world application. Simulation results show that compared with existing scheduling algorithms in the literature, our scheduling algorithm improves reliability by up to 22.4% (with an average of 16.4%) and achieves an improvement in performability, a measure that combines reliability and schedulability, by up to 421.9% (with an average of 49.3%).

[1]  Dharma P. Agrawal,et al.  Scheduling of periodic time critical applications for pipelined execution on heterogeneous systems , 2001, International Conference on Parallel Processing, 2001..

[2]  Kang G. Shin,et al.  Allocation of Periodic Task Modules with Precedence and Deadline Constraints , 1997, IEEE Trans. Computers.

[3]  Atakan Dogan,et al.  Reliable matching and scheduling of precedence-constrained tasks in heterogeneous distributed computing , 2000, Proceedings 2000 International Conference on Parallel Processing.

[4]  Sang Hyuk Son,et al.  An algorithm for real-time fault-tolerant scheduling in multiprocessor systems , 1992, Fourth Euromicro workshop on Real-Time Systems.

[5]  Krithi Ramamritham,et al.  Allocation and Scheduling of Precedence-Related Periodic Tasks , 1995, IEEE Trans. Parallel Distributed Syst..

[6]  Yingfeng Oh,et al.  Scheduling real-time tasks for dependability , 1995 .

[7]  Luigi V. Mancini,et al.  Fault-Tolerant Rate-Monotonic First-Fit Scheduling in Hard-Real-Time Systems , 1999, IEEE Trans. Parallel Distributed Syst..

[8]  Louise E. Moser,et al.  Unification of transactions and replication in three-tier architectures based on CORBA , 2005, IEEE Transactions on Dependable and Secure Computing.

[9]  Rami G. Melhem,et al.  Tolerance to Multiple Transient Faults for Aperiodic Tasks in Hard Real-Time Systems , 2000, IEEE Trans. Computers.

[10]  Yves Sorel,et al.  An algorithm for automatically obtaining distributed and fault-tolerant static schedules , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..

[11]  Xiao Qin,et al.  An efficient fault-tolerant scheduling algorithm for real-time tasks with precedence constraints in heterogeneous systems , 2002, Proceedings International Conference on Parallel Processing.

[12]  Martin Naedele Fault-tolerant real-time scheduling under execution time constraints , 1999, Proceedings Sixth International Conference on Real-Time Computing Systems and Applications. RTCSA'99 (Cat. No.PR00306).

[13]  Ladislau Bölöni,et al.  A comparison study of static mapping heuristics for a class of meta-tasks on heterogeneous computing systems , 1999, Proceedings. Eighth Heterogeneous Computing Workshop (HCW'99).

[14]  Xiao Qin,et al.  Dynamic, reliability-driven scheduling of parallel real-time jobs in heterogeneous systems , 2001, International Conference on Parallel Processing, 2001..

[15]  Yves Sorel,et al.  Fault-tolerant static scheduling for real-time distributed embedded systems , 2001, Proceedings 21st International Conference on Distributed Computing Systems.

[16]  Jong Kim,et al.  Fault-tolerant real-time scheduling using passive replicas , 1997, Proceedings Pacific Rim International Symposium on Fault-Tolerant Systems.

[17]  Yves Sorel,et al.  Off-line real-time fault-tolerant scheduling , 2001, Proceedings Ninth Euromicro Workshop on Parallel and Distributed Processing.

[18]  Dhabaleswar K. Panda,et al.  High performance implementation of MPI derived datatype communication over InfiniBand , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[19]  Atakan Dogan,et al.  Matching and Scheduling Algorithms for Minimizing Execution Time and Failure Probability of Applications in Heterogeneous Computing , 2002, IEEE Trans. Parallel Distributed Syst..

[20]  Alain Girault,et al.  A bi-criteria scheduling heuristic for distributed embedded systems under reliability and real-time constraints , 2004, International Conference on Dependable Systems and Networks, 2004.

[21]  C. Murray Woodside,et al.  Fast Allocation of Processes in Distributed and Parallel Systems , 1993, IEEE Trans. Parallel Distributed Syst..

[22]  Kang G. Shin,et al.  Combined Task and Message Scheduling in Distributed Real-Time Systems , 1999, IEEE Trans. Parallel Distributed Syst..

[23]  Mariagiovanna Sami,et al.  A high-level synthesis approach to design of fault-tolerant systems , 1997, Proceedings. 15th IEEE VLSI Test Symposium (Cat. No.97TB100125).

[24]  Salim Hariri,et al.  Architectural support for designing fault-tolerant open distributed systems , 1992, Computer.

[25]  Daniel Mossé,et al.  A responsiveness approach for scheduling fault recovery in real-time systems , 1999, Proceedings of the Fifth IEEE Real-Time Technology and Applications Symposium.

[26]  G. Manimaran,et al.  A new fault-tolerant technique for improving schedulability in multiprocessor real-time systems , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[27]  Nancy M. Amato,et al.  Task Scheduling and Parallel Mesh-Sweeps in Transport Computations , 2000 .

[28]  Gerhard Fohler Adaptive fault-tolerance with statically scheduled real-time systems , 1997, Proceedings Ninth Euromicro Workshop on Real Time Systems.

[29]  Rami G. Melhem,et al.  Fault-Tolerance Through Scheduling of Aperiodic Tasks in Hard Real-Time Multiprocessor Systems , 1997, IEEE Trans. Parallel Distributed Syst..

[30]  Charles L. Seitz,et al.  Myrinet: A Gigabit-per-Second Local Area Network , 1995, IEEE Micro.

[31]  Giorgio C. Buttazzo,et al.  Optimal scheduling for fault-tolerant and firm real-time systems , 1998, Proceedings Fifth International Conference on Real-Time Computing Systems and Applications (Cat. No.98EX236).

[32]  Yves Sorel,et al.  Generation of fault-tolerant static scheduling for real-time distributed embedded systems with multi-point links , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[33]  Jorge Santos,et al.  Scheduling heterogeneous multimedia servers: different QoS for hard, soft and non real-time clients , 2000, Proceedings 12th Euromicro Conference on Real-Time Systems. Euromicro RTS 2000.

[34]  J.-P. Wang,et al.  Task Allocation for Maximizing Reliability of Distributed Computer Systems , 1992, IEEE Trans. Computers.

[35]  Lonnie R. Welch,et al.  Heterogeneous resource management for dynamic real-time systems , 2000, Proceedings 9th Heterogeneous Computing Workshop (HCW 2000) (Cat. No.PR00556).

[36]  Y.-K. Kwok,et al.  Static scheduling algorithms for allocating directed task graphs to multiprocessors , 1999, CSUR.

[37]  Niraj K. Jha,et al.  Safety and Reliability Driven Task Allocation in Distributed Systems , 1999, IEEE Trans. Parallel Distributed Syst..

[38]  Rami G. Melhem,et al.  Fault tolerant real-time global scheduling on multiprocessors , 1999, Proceedings of 11th Euromicro Conference on Real-Time Systems. Euromicro RTS'99.

[39]  C. Siva Ram Murthy,et al.  A Fault-Tolerant Dynamic Scheduling Algorithm for Multiprocessor Real-Time Systems and Its Analysis , 1998, IEEE Trans. Parallel Distributed Syst..