Reliability-aware scheduling strategy for heterogeneous distributed computing systems

Heterogeneous computing systems are promising computing platforms, since single parallel architecture based systems may not be sufficient to exploit the available parallelism with the running applications. In some cases, heterogeneous distributed computing (HDC) systems can achieve higher performance with lower cost than single-machine supersystems. However, in HDC systems, processors and networks are not failure free and any kind of failure may be critical to the running applications. One way of dealing with such failures is to employ a reliable scheduling algorithm. Unfortunately, most existing scheduling algorithms for precedence constrained tasks in HDC systems do not adequately consider reliability requirements of inter-dependent tasks. In this paper, we design a reliability-driven scheduling architecture that can effectively measure system reliability, based on an optimal reliability communication path search algorithm, and then we introduce reliability priority rank (RRank) to estimate the task's priority by considering reliability overheads. Furthermore, based on directed acyclic graph (DAG) we propose a reliability-aware scheduling algorithm for precedence constrained tasks, which can achieve high quality of reliability for applications. The comparison studies, based on both randomly generated graphs and the graphs of some real applications, show that our scheduling algorithm outperforms the existing scheduling algorithms in terms of makespan, scheduling length ratio, and reliability. At the same time, the improvement gained by our algorithm increases as the data communication among tasks increases.

[1]  James S. Plank,et al.  Experimental assessment of workstation failures and their impact on checkpointing systems , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[2]  Yves Robert,et al.  Contention awareness and fault-tolerant scheduling for precedence constrained tasks in heterogeneous systems , 2009, Parallel Comput..

[3]  Atakan Dogan,et al.  Optimal and suboptimal reliable scheduling of precedence-constrained tasks in heterogeneous distributed computing , 2000, Proceedings 2000. International Workshop on Parallel Processing.

[4]  David A. Padua,et al.  Communication contention in APN list scheduling algorithm , 2009, Science in China Series F: Information Sciences.

[5]  Frode Eika Sandnes,et al.  Toward a realistic task scheduling model , 2006, IEEE Transactions on Parallel and Distributed Systems.

[6]  Henri Casanova,et al.  Network modeling issues for grid application scheduling , 2005, Int. J. Found. Comput. Sci..

[7]  S. Ranka,et al.  Applications and performance analysis of a compile-time optimization approach for list scheduling algorithms on distributed memory multiprocessors , 1992, Proceedings Supercomputing '92.

[8]  Byung Kook Kim,et al.  An optimal scheduling algorithm for minimizing the computing period of cyclic synchronous tasks on multiprocessors , 2001, J. Syst. Softw..

[9]  C. Murray Woodside,et al.  Fast Allocation of Processes in Distributed and Parallel Systems , 1993, IEEE Trans. Parallel Distributed Syst..

[10]  Jan Janecek,et al.  A high performance, low complexity algorithm for compile-time task scheduling in heterogeneous systems , 2005, Parallel Comput..

[11]  Dharma P. Agrawal,et al.  Optimal Scheduling Algorithm for Distributed-Memory Machines , 1998, IEEE Trans. Parallel Distributed Syst..

[12]  Kuldip Singh,et al.  Dealing with heterogeneity through limited duplication for scheduling precedence constrained task graphs , 2005, J. Parallel Distributed Comput..

[13]  Min Xie,et al.  Iterative list scheduling for heterogeneous computing , 2005, J. Parallel Distributed Comput..

[14]  Salim Hariri,et al.  Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing , 2002, IEEE Trans. Parallel Distributed Syst..

[15]  Kenli Li,et al.  List scheduling with duplication for heterogeneous computing systems , 2010, J. Parallel Distributed Comput..

[16]  Masahiro Tsuchiya,et al.  A Task Allocation Model for Distributed Computing Systems , 1982, IEEE Transactions on Computers.

[17]  Xiao Qin,et al.  A novel fault-tolerant scheduling algorithm for precedence constrained tasks in real-time heterogeneous systems , 2006, Parallel Comput..

[18]  Arjan J. C. van Gemund,et al.  Low-Cost Task Scheduling for Distributed-Memory Machines , 2002, IEEE Trans. Parallel Distributed Syst..

[19]  Wesley W. Chu,et al.  Task Allocation in Distributed Data Processing , 1980, Computer.

[20]  Cheng-Zhong Xu,et al.  Harmonic proportional bandwidth allocation and scheduling for service differentiation on streaming servers , 2004, IEEE Transactions on Parallel and Distributed Systems.

[21]  Yves Sorel,et al.  Generation of fault-tolerant static scheduling for real-time distributed embedded systems with multi-point links , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[22]  Nawwaf N. Kharma,et al.  A high performance algorithm for static task scheduling in heterogeneous distributed computing systems , 2008, J. Parallel Distributed Comput..

[23]  Mihalis Yannakakis,et al.  Towards an Architecture-Independent Analysis of Parallel Algorithms , 1990, SIAM J. Comput..

[24]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[25]  Edward A. Lee,et al.  A Compile-Time Scheduling Heuristic for Interconnection-Constrained Heterogeneous Processor Architectures , 1993, IEEE Trans. Parallel Distributed Syst..

[26]  J.-P. Wang,et al.  Task Allocation for Maximizing Reliability of Distributed Computer Systems , 1992, IEEE Trans. Computers.

[27]  Xiao Qin,et al.  A dynamic and reliability-driven scheduling algorithm for parallel real-time jobs executing on heterogeneous clusters , 2005, J. Parallel Distributed Comput..

[28]  Harold S. Stone,et al.  Multiprocessor Scheduling with the Aid of Network Flow Algorithms , 1977, IEEE Transactions on Software Engineering.

[29]  C. Siva Ram Murthy,et al.  Task Allocation Algorithms for Maximizing Reliability of Distributed Computing Systems , 1997, IEEE Trans. Computers.

[30]  Kuldip Singh,et al.  An Improved Duplication Strategy for Scheduling Precedence Constrained Graphs in Multiprocessor Systems , 2003, IEEE Trans. Parallel Distributed Syst..

[31]  Imtiaz Ahmad,et al.  An Integrated Technique for Task Matching and Scheduling onto Distributed Heterogeneous Computing Systems , 2002, J. Parallel Distributed Comput..

[32]  Ishfaq Ahmad,et al.  Dynamic Critical-Path Scheduling: An Effective Technique for Allocating Task Graphs to Multiprocessors , 1996, IEEE Trans. Parallel Distributed Syst..

[33]  Ishfaq Ahmad,et al.  On Exploiting Task Duplication in Parallel Program Scheduling , 1998, IEEE Trans. Parallel Distributed Syst..

[34]  Atakan Dogan,et al.  Matching and Scheduling Algorithms for Minimizing Execution Time and Failure Probability of Applications in Heterogeneous Computing , 2002, IEEE Trans. Parallel Distributed Syst..

[35]  K. Mani Chandy,et al.  A comparison of list schedules for parallel processing systems , 1974, Commun. ACM.

[36]  Hesham El-Rewini,et al.  Scheduling Parallel Program Tasks onto Arbitrary Target Machines , 1990, J. Parallel Distributed Comput..

[37]  Dongseung Kim,et al.  A Two-Pass Scheduling Algorithm for Parallel Programs , 1994, Parallel Comput..

[38]  Bharadwaj Veeravalli,et al.  On the design of communication-aware fault-tolerant scheduling algorithms for precedence constrained tasks in grid computing systems with dedicated communication devices , 2009, J. Parallel Distributed Comput..

[39]  Daniel Gajski,et al.  Hypertool: A Programming Aid for Message-Passing Systems , 1990, IEEE Trans. Parallel Distributed Syst..

[40]  C. Siva Ram Murthy,et al.  A Fault-Tolerant Dynamic Scheduling Algorithm for Multiprocessor Real-Time Systems and Its Analysis , 1998, IEEE Trans. Parallel Distributed Syst..