论文信息 - A Reliability Analysis for Successful Execution of Parallel DAG Tasks

A Reliability Analysis for Successful Execution of Parallel DAG Tasks

Large scale parallel computing system is becoming more and more failure-prone due to the increasing number of computational nodes. This results in serious reliability problems in parallel computing. To ensure successfully running of parallel tasks such as Meta tasks and DAG tasks, it is necessary to perform reliability analysis before scheduling parallel tasks. For Meta tasks, some key factors are discussed that affect and impede successful execution of a single task. Then, the reliability formula of Meta tasks is presented. For DAG tasks, hardware failures, software failures, network link failures and subtask execution order are all taken into account. We shall calculate not only the reliability of subtasks, but also the reliability of network communication. Then two reliability algorithms of DAG tasks are designed. Finally, some experiments are conducted. Experimental results show that our reliability analysis methods are more effective and comprehensive.

Guosun Zeng | Wei Wang | Wen-Juan Liu | Ke-Kun Hu

[1] Wei Wang,et al. Upper Limit Analysis of Scalable Parallel Computing on the Premise of Reliability Requirement , 2016 .

[2] Yuan-Shun Dai,et al. Reliability of grid service systems , 2006, Comput. Ind. Eng..

[3] Jack Dongarra,et al. 1 Cloud Service Reliability : Modeling and Analysis , 2010 .

[4] Hong He,et al. A novel discrete particle swarm optimization algorithm for meta-task assignment in heterogeneous computing systems , 2011, Microprocess. Microsystems.

[5] Hong-Zhong Huang,et al. Grid Service Reliability Modeling and Optimal Task Scheduling Considering Fault Recovery , 2011, IEEE Transactions on Reliability.

[6] Yun Zhou,et al. The Reliability Wall for Exascale Supercomputing , 2012, IEEE Transactions on Computers.

[7] Yiping Yang,et al. Scheduling of fork-join tasks on multi-core processors to avoid communication conflict , 2015, TENCON 2015 - 2015 IEEE Region 10 Conference.

[8] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[9] Zbigniew J. Czech,et al. Introduction to Parallel Computing , 2017 .

[10] Xie Guo. DAG Reliability Model and Fault-Tolerant Algorithm for Heterogeneous Distributed Systems , 2013 .

[11] W. D. van Driel,et al. Software reliability and its interaction with hardware reliability , 2014, 2014 15th International Conference on Thermal, Mechanical and Mulit-Physics Simulation and Experiments in Microelectronics and Microsystems (EuroSimE).

[12] Ian T. Foster,et al. Making a case for distributed file systems at Exascale , 2011, LSAP '11.

[13] Teresa Gomes,et al. An effective algorithm for computing all‐terminal reliability bounds , 2015, Networks.

[14] Bianca Schroeder,et al. A Large-Scale Study of Failures in High-Performance Computing Systems , 2010, IEEE Trans. Dependable Secur. Comput..

[15] S. Thirumurugan,et al. Analysis of Testing and Operational Software Reliability in SRGM based on NHPP , 2007 .

[16] Hai Jin,et al. Reliability Analysis for Grid Computing , 2004, GCC.

[17] Sagar Dhakal,et al. Maximizing Service Reliability in Distributed Computing Systems with Random Node Failures: Theory and Implementation , 2010, IEEE Transactions on Parallel and Distributed Systems.