DAG Reliability Model and Fault-Tolerant Algorithm for Heterogeneous Distributed Systems

The performance of heterogeneous distributed systems has been improved significantly,but caused increased failures dramatically.Tolerant scheduling in heterogeneous distributed systems with DAG(Directed Acyclic Graph)task model becomes a research focus.Widely used fault-tolerant algorithms based on task replication have following problems:(1)there are some deficiencies and lack of rigorous proof on constraint between DAG task reliability requirement and DAG reliability requirement;(2)only one backup copy of each task,which not enough to cope with potential repeated failures;(3)blindly to tolerateefaults of each task withe+1backup copies,which improved the reliability of system,but caused high redundancy and resources consumption.Firstly,task dependencies of DAG are analyzed,then the DAG task reliability probability model is determined and based on this,the DAG reliability model is constructed.Secondly,lower limit of task duplication algorithm,economic task duplication strategy algorithm and greedy algorithm for task replication strategy are presented to meet the reliability target of DAG and achieve precise quantification for each task's replicas.Finally,the OPDFT(Optional Policy on DAG Fault-Tolerant)algorithm is proposed based on above 3algorithms.Experiments show that the reliability cost of economic policy and greed policy of OPDFT algorithm is about 60% and70% of blind strategy respectively.