Scheduling for fault tolerance and real time in multicomputer systems

The dissertation addresses scheduling for fault-tolerant and real-time computing in parallel/distributed environments. Primarily, the crash fault model is considered and time redundancy is emphasized. Two different approaches based upon the task periodicity or the lack of it are presented. The schedulability of a periodic task set scheduled with Rate Monotonic Scheduling (RMS) in a system susceptible to a single fault is analyzed. The priorities of tasks obtained from RMS analysis are maintained even during recovery. Under these conditions, it is guaranteed that no task will miss a single deadline even in the presence of a fault if the utilization factor of the processor does not exceed 0.5. Thus 0.5 is the minimum achievable utilization that permits recovery from faults before the expiration of the deadlines of the tasks. This bound is larger than the naive bound of 0.345, the half of the traditional bound of 0.69 for schedulability for the RMS policy, that would be obtained if computation times were doubled to provide for re-executions in RMS analysis. The result provides scheduling guarantees for tolerating a variety of intermittent and transient hardware and software faults that can be handled simply by re-execution. In addition, it is demonstrated how permanent faults can be tolerated efficiently by maintaining common spares among a set of processors that are independently executing periodic tasks. For a system with aperiodic tasks, a dynamic scheduler that uses consensus, the operation of reaching an agreement among working processors on some aspect of system state, is designed. Scheduling is performed locally in a distributed system using the Earliest Deadline First policy and consensus is reached upon the schedule to facilitate recovery and sharing of tasks from processors that have failed. Various consensus operations in the system are consolidated and consensus itself is scheduled only at regular intervals. Simulation is used to analyze the behavior of the system in presence of faults and results indicate a substantial improvement in diminishing the number of lost tasks. An actual implementation of the scheduler is used in CORE (COnsensus based REsponsive system) for a network of distributed workstations and its successful operation validated the design decisions.