Fault-tolerant scheduling with dynamic number of replicas in heterogeneous systems

In the existing studies on fault-tolerant scheduling, the active replication schema makes use of e + 1 replicas for each task to tolerate E failures. However, in this paper, we show that it does not always lead to a higher reliability with more replicas. Besides, the more replicas implies more resource consumption and higher economic cost. To address this problem, with the target to satisfy the user's reliability requirement with minimum resources, this paper proposes a new fault tolerant scheduling algorithm: MaxRe. In the algorithm, we incorporate the reliability analysis into the active replication schema, and exploit a dynamic number of replicas for different tasks. Both the theoretical analysis and experiments prove that the MaxRe algorithm's schedule can certainly satisfy user's reliability requirements. And the MaxRe scheduling algorithm can achieve the corresponding reliability with at most 70% fewer resources than the FTSA algorithm.

[1]  Mourad Hakem,et al.  Reliability and Scheduling on Systems Subject to Failures , 2007, 2007 International Conference on Parallel Processing (ICPP 2007).

[2]  Yves Sorel,et al.  An algorithm for automatically obtaining distributed and fault-tolerant static schedules , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..

[3]  John W. Young,et al.  A first order approximation to the optimum checkpoint interval , 1974, CACM.

[4]  Yi He,et al.  Reliability driven task scheduling for heterogeneous systems , 2003 .

[5]  Israel Koren,et al.  Fault-Tolerant Systems , 2007 .

[6]  Bharadwaj Veeravalli,et al.  On the design of communication-aware fault-tolerant scheduling algorithms for precedence constrained tasks in grid computing systems with dedicated communication devices , 2009, J. Parallel Distributed Comput..

[7]  Chien-Min Wang,et al.  A Reliability-Aware Approach for Web Services Execution Planning , 2007, 2007 IEEE Congress on Services (Services 2007).

[8]  Zhiling Lan,et al.  Performance under Failures of DAG-based Parallel Computing , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[9]  Atakan Dogan,et al.  Biobjective Scheduling Algorithms for Execution Time?Reliability Trade-off in Heterogeneous Computing Systems , 2005, Comput. J..

[10]  G. Manimaran,et al.  A reliability-aware value-based scheduler for dynamic multiprocessor real-time systems , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[11]  S. Scott,et al.  Reliability Analysis in HPC clusters , 2006 .

[12]  Salim Hariri,et al.  Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing , 2002, IEEE Trans. Parallel Distributed Syst..

[13]  Xiao Qin,et al.  A novel fault-tolerant scheduling algorithm for precedence constrained tasks in real-time heterogeneous systems , 2006, Parallel Comput..

[14]  Bharadwaj Veeravalli,et al.  On the Design of Fault-Tolerant Scheduling Strategies Using Primary-Backup Approach for Computational Grids with Low Replication Costs , 2009, IEEE Transactions on Computers.

[15]  Yves Robert,et al.  Contention awareness and fault-tolerant scheduling for precedence constrained tasks in heterogeneous systems , 2009, Parallel Comput..

[16]  Yves Robert,et al.  Fault tolerant scheduling of precedence task graphs on heterogeneous platforms , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[17]  Emmanuel Jeannot,et al.  Bi-objective scheduling algorithms for optimizing makespan and reliability on heterogeneous systems , 2007, SPAA '07.

[18]  Christopher E. Dabrowski,et al.  Reliability in grid computing systems , 2009, Concurr. Comput. Pract. Exp..

[19]  Kenta Hashimoto Effective Scheduling of Duplicated Tasks for Fault Tolerance in Multiprocessor Systems , 2002 .