An effective reliability-driven technique of allocating tasks on heterogeneous cluster systems

In large-scale heterogeneous cluster computing systems, processor and network failures are inevitable and can have an adverse effect on applications executing on such systems. One way of taking failures into account is to employ a reliable scheduling algorithm. However, most existing scheduling algorithms for precedence constrained tasks in heterogeneous systems only consider scheduling length, and not efficiently satisfy the reliability requirements of task. In recognition of this problem, we build an application reliability analysis model based on Weibull distribution, which can dynamically measure the reliability of task executing on heterogeneous cluster with arbitrary networks architectures. Then, we propose a reliability-driven earliest finish time with duplication scheduling algorithm (REFTD) which incorporates task reliability overhead into scheduling. Furthermore, to improve system reliability, it duplicates task as if task hazard rate is more than threshold $$\theta $$θ. The comparison study, based on both randomly generated graphs and the graphs of some real applications, shows that our scheduling algorithm can shorten schedule length and improve system reliability significantly.

[1]  Hoang Pham,et al.  Software field failure rate prediction before software deployment , 2006, J. Syst. Softw..

[2]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[3]  Frode Eika Sandnes,et al.  Toward a realistic task scheduling model , 2006, IEEE Transactions on Parallel and Distributed Systems.

[4]  Rajkumar Buyya,et al.  Enhancing performance of failure-prone clusters by adaptive provisioning of cloud resources , 2012, The Journal of Supercomputing.

[5]  Henri Casanova,et al.  Network modeling issues for grid application scheduling , 2005, Int. J. Found. Comput. Sci..

[6]  Dzmitry Kliazovich,et al.  DENS: Data Center Energy-Efficient Network-Aware Scheduling , 2010, GreenCom/CPSCom.

[7]  Rui Li,et al.  A Load-balancing method for network GISs in a heterogeneous cluster-based system using access density , 2013, Future Gener. Comput. Syst..

[8]  Minhaj Ahmad Khan,et al.  Scheduling for heterogeneous Systems using constrained critical paths , 2012, Parallel Comput..

[9]  Albert Y. Zomaya,et al.  A performance evaluation of CP list scheduling heuristics for communication intensive task graphs , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.

[10]  D. N. Prabhakar Murthy,et al.  Weibull model selection for reliability modelling , 2004, Reliab. Eng. Syst. Saf..

[11]  Salim Hariri,et al.  Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing , 2002, IEEE Trans. Parallel Distributed Syst..

[12]  Bharadwaj Veeravalli,et al.  On the Design of Fault-Tolerant Scheduling Strategies Using Primary-Backup Approach for Computational Grids with Low Replication Costs , 2009, IEEE Transactions on Computers.

[13]  K. Das A comparative study of exponential distribution vs Weibull distribution in machine reliability analysis in a CMS design , 2008, Comput. Ind. Eng..

[14]  Edward A. Lee,et al.  A Compile-Time Scheduling Heuristic for Interconnection-Constrained Heterogeneous Processor Architectures , 1993, IEEE Trans. Parallel Distributed Syst..

[15]  Atakan Dogan,et al.  Matching and Scheduling Algorithms for Minimizing Execution Time and Failure Probability of Applications in Heterogeneous Computing , 2002, IEEE Trans. Parallel Distributed Syst..

[16]  Loon Ching Tang,et al.  Reliability evaluation of hard disk drive failures based on counting processes , 2013, Reliab. Eng. Syst. Saf..

[17]  David A. Padua,et al.  Communication contention in APN list scheduling algorithm , 2009, Science in China Series F: Information Sciences.

[18]  Emmanuel Jeannot,et al.  Optimizing performance and reliability on heterogeneous parallel systems: Approximation algorithms and heuristics , 2012, J. Parallel Distributed Comput..

[19]  Dzmitry Kliazovich,et al.  DENS: data center energy-efficient network-aware scheduling , 2010, Cluster Computing.

[20]  Kenli Li,et al.  Reliability-aware scheduling strategy for heterogeneous distributed computing systems , 2010, J. Parallel Distributed Comput..

[21]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[22]  Xiaomin Zhu,et al.  3E: Energy-efficient elastic scheduling for independent tasks in heterogeneous computing systems , 2013, J. Syst. Softw..

[23]  Michael O. Ball,et al.  Computational Complexity of Network Reliability Analysis: An Overview , 1986, IEEE Transactions on Reliability.

[24]  Raju Nedunchezhian,et al.  A hybrid policy for fault tolerant load balancing in grid computing environments , 2012, J. Netw. Comput. Appl..

[25]  Dimitrios Skoutas,et al.  Efficient task replication and management for adaptive fault tolerance in Mobile Grid environments , 2007, Future Gener. Comput. Syst..

[26]  Rizos Sakellariou,et al.  An Experimental Investigation into the Rank Function of the Heterogeneous Earliest Finish Time Scheduling Algorithm , 2003, Euro-Par.

[27]  Zhiling Lan,et al.  Performance under Failures of DAG-based Parallel Computing , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[28]  Xiao Qin,et al.  A dynamic and reliability-driven scheduling algorithm for parallel real-time jobs executing on heterogeneous clusters , 2005, J. Parallel Distributed Comput..

[29]  Ishfaq Ahmad,et al.  Dynamic Critical-Path Scheduling: An Effective Technique for Allocating Task Graphs to Multiprocessors , 1996, IEEE Trans. Parallel Distributed Syst..

[30]  Frank Mueller,et al.  Autogeneration and Autotuning of 3D Stencil Codes on Homogeneous and Heterogeneous GPU Clusters , 2013, IEEE Transactions on Parallel and Distributed Systems.