A Reliability-aware Task Scheduling Algorithm Based on Replication on Heterogeneous Computing Systems

Over the past several years, a heterogeneous computing (HC) system has become more competitive as a commercial computing platform than a homogeneous system. With the growing scale of HC systems, network failures become inevitable. To achieve high performance, communication reliability should be considered while designing reliability-aware task scheduling algorithms. In this paper, we propose a new algorithm called RMSR (Replication-based scheduling for Maximizing System Reliability), which incorporates task communication into system reliability. To maximize communication reliability, an improved algorithm which searches all optimal reliability communication paths for current tasks is proposed. During the task replication phase, the task reliability threshold is determined by users and each task has dynamic replicas. Our comparative studies for both randomly generated graphs and application graphs of real-world problems show that our RMSR algorithm outperforms existing scheduling algorithms in terms of system reliability. For randomly generated graphs, several factors affecting the performance are analyzed in the paper. For an application graph of a real-world problem with a fixed DAG, the system reliability of the RMSR algorithm is at most influenced by one factor.

[1]  S. Ranka,et al.  Applications and performance analysis of a compile-time optimization approach for list scheduling algorithms on distributed memory multiprocessors , 1992, Proceedings Supercomputing '92.

[2]  Kouichi Sakurai,et al.  A Resource Minimizing Scheduling Algorithm with Ensuring the Deadline and Reliability in Heterogeneous Systems , 2011, 2011 IEEE International Conference on Advanced Information Networking and Applications.

[3]  Hamid Arabnejad,et al.  List Scheduling Algorithm for Heterogeneous Systems by an Optimistic Cost Table , 2014, IEEE Transactions on Parallel and Distributed Systems.

[4]  Hai Jin,et al.  Grid workflow scheduling based on reliability cost , 2007, InfoScale '07.

[5]  Rizos Sakellariou,et al.  DAG Scheduling Using a Lookahead Variant of the Heterogeneous Earliest Finish Time Algorithm , 2010, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing.

[6]  C. Siva Ram Murthy,et al.  Task Allocation Algorithms for Maximizing Reliability of Distributed Computing Systems , 1997, IEEE Trans. Computers.

[7]  Ishfaq Ahmad,et al.  Benchmarking and Comparison of the Task Graph Scheduling Algorithms , 1999, J. Parallel Distributed Comput..

[8]  J. S. Raj,et al.  A survey on reliability scheduling on grid computing , 2013, 2013 7th International Conference on Intelligent Systems and Control (ISCO).

[9]  Chengbin Chu,et al.  Reliability allocation through cost minimization , 2003, IEEE Trans. Reliab..

[10]  Suleyman Tosun,et al.  Energy- and reliability-aware task scheduling onto heterogeneous MPSoC architectures , 2012, The Journal of Supercomputing.

[11]  Daniel Gajski,et al.  Hypertool: A Programming Aid for Message-Passing Systems , 1990, IEEE Trans. Parallel Distributed Syst..

[12]  Klaudia Frankfurter Computers And Intractability A Guide To The Theory Of Np Completeness , 2016 .

[13]  Kenli Li,et al.  Reliability-aware scheduling strategy for heterogeneous distributed computing systems , 2010, J. Parallel Distributed Comput..

[14]  Yves Robert,et al.  Fault tolerant scheduling of precedence task graphs on heterogeneous platforms , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[15]  Rajkumar Buyya,et al.  Optimizing the makespan and reliability for workflow applications with reputation and a look-ahead genetic algorithm , 2011, Future Gener. Comput. Syst..

[16]  Kenli Li,et al.  An effective reliability-driven technique of allocating tasks on heterogeneous cluster systems , 2014, Cluster Computing.

[17]  Matti A. Hiltunen,et al.  Fault-tolerant grid services using primary-backup: feasibility and performance , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[18]  Uwe Schwiegelshohn,et al.  Job Allocation Strategies with User Run Time Estimates for Online Scheduling in Hierarchical Grids , 2011, Journal of Grid Computing.

[19]  Chung-Chi Hsieh Optimal task allocation and hardware redundancy policies in distributed computing systems , 2003, Eur. J. Oper. Res..

[20]  Hesham El-Rewini,et al.  Scheduling Parallel Program Tasks onto Arbitrary Target Machines , 1990, J. Parallel Distributed Comput..

[21]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[22]  Jue-Sam Chou,et al.  A fast algorithm for reliability-oriented task assignment in a distributed system , 2002, Comput. Commun..

[23]  J.-P. Wang,et al.  Task Allocation for Maximizing Reliability of Distributed Computer Systems , 1992, IEEE Trans. Computers.

[24]  Rami Melhem,et al.  The effects of energy management on reliability in real-time embedded systems , 2004, ICCAD 2004.

[25]  Xiao Qin,et al.  A dynamic and reliability-driven scheduling algorithm for parallel real-time jobs executing on heterogeneous clusters , 2005, J. Parallel Distributed Comput..

[26]  Kouichi Sakurai,et al.  Fault-tolerant scheduling with dynamic number of replicas in heterogeneous systems , 2010, 2010 IEEE 12th International Conference on High Performance Computing and Communications (HPCC).

[27]  Jeffrey D. Ullman,et al.  NP-Complete Scheduling Problems , 1975, J. Comput. Syst. Sci..

[28]  Emmanuel Jeannot,et al.  Bi-objective scheduling algorithms for optimizing makespan and reliability on heterogeneous systems , 2007, SPAA '07.

[29]  Jin Zhang,et al.  An approach to analyze grid service reliability subject to failures , 2009, 2009 4th International Conference on Computer Science & Education.

[30]  Dakai Zhu,et al.  On Maximizing Reliability of Real-Time Embedded Applications Under Hard Energy Constraint , 2010, IEEE Transactions on Industrial Informatics.

[31]  Füsun Özgüner,et al.  Parallelizing Existing Applications in a Distributed Heterogeneous Environment , 1995 .

[32]  Rizos Sakellariou,et al.  A hybrid heuristic for DAG scheduling on heterogeneous systems , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[33]  Atakan Dogan,et al.  Matching and Scheduling Algorithms for Minimizing Execution Time and Failure Probability of Applications in Heterogeneous Computing , 2002, IEEE Trans. Parallel Distributed Syst..

[34]  James S. Plank,et al.  Experimental assessment of workstation failures and their impact on checkpointing systems , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[35]  Kenli Li,et al.  A hierarchical reliability-driven scheduling algorithm in grid systems , 2012, J. Parallel Distributed Comput..

[36]  Atakan Dogan,et al.  Optimal and suboptimal reliable scheduling of precedence-constrained tasks in heterogeneous distributed computing , 2000, Proceedings 2000. International Workshop on Parallel Processing.

[37]  Nawwaf N. Kharma,et al.  A high performance algorithm for static task scheduling in heterogeneous distributed computing systems , 2008, J. Parallel Distributed Comput..

[38]  Niraj K. Jha,et al.  Safety and Reliability Driven Task Allocation in Distributed Systems , 1999, IEEE Trans. Parallel Distributed Syst..

[39]  C. Siva Ram Murthy,et al.  Algorithms for reliability-oriented module allocation in distributed computing systems , 1998, J. Syst. Softw..

[40]  Min Xie,et al.  Iterative list scheduling for heterogeneous computing , 2005, J. Parallel Distributed Comput..

[41]  Salim Hariri,et al.  Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing , 2002, IEEE Trans. Parallel Distributed Syst..

[42]  Bharadwaj Veeravalli,et al.  On the Design of Fault-Tolerant Scheduling Strategies Using Primary-Backup Approach for Computational Grids with Low Replication Costs , 2009, IEEE Transactions on Computers.

[43]  Edward A. Lee,et al.  A Compile-Time Scheduling Heuristic for Interconnection-Constrained Heterogeneous Processor Architectures , 1993, IEEE Trans. Parallel Distributed Syst..