Fault-Tolerant Dynamic Rescheduling for Heterogeneous Computing Systems

As the scale and complexity of heterogeneous computing systems grow, failures occur frequently and have an adverse effect on solving large-scale applications. Hence, fault-tolerant scheduling is an imperative step for large-scale computing systems. The existing fault-tolerant scheduling algorithms belong to static scheduling, and they allocate multiple copies of each task to several processors no matter whether processor failures affect the execution of tasks. Such active replication strategies not only waste resource but also sacrifice the makespan. What is more, they cannot guarantee the successful execution of applications. In this paper, we propose a fault-tolerant dynamic rescheduling algorithm named FTDR, which can overcome above drawbacks. FTDR keeps listening to the processor failure, and reschedules the suspended tasks once failures occur. Because FTDR reschedules the tasks that are suspended because of failures, it can tolerate an arbitrary number of failures. Randomly generated DAGs are tested in our experiments. Experimental results show that the proposed algorithm achieves good performance in terms of makespan and resource consumption compared with its direct competitors.

[1]  Yves Robert,et al.  Fault tolerant scheduling of precedence task graphs on heterogeneous platforms , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[2]  Yves Robert,et al.  Realistic Models and Efficient Algorithms for Fault Tolerant Scheduling on Heterogeneous Platforms , 2008, 2008 37th International Conference on Parallel Processing.

[3]  Jan Janecek,et al.  A high performance, low complexity algorithm for compile-time task scheduling in heterogeneous systems , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[4]  Minhaj Ahmad Khan,et al.  Scheduling for heterogeneous Systems using constrained critical paths , 2012, Parallel Comput..

[5]  RICHARD KOO,et al.  Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.

[6]  Emmanuel Jeannot,et al.  Bi-objective scheduling algorithms for optimizing makespan and reliability on heterogeneous systems , 2007, SPAA '07.

[7]  John W. Young,et al.  A first order approximation to the optimum checkpoint interval , 1974, CACM.

[8]  Pascal Bouvry,et al.  Energy-Aware Scheduling on Multicore Heterogeneous Grid Computing Systems , 2013, Journal of Grid Computing.

[9]  Hironori Kasahara,et al.  Practical Multiprocessor Scheduling Algorithms for Efficient Parallel Processing , 1984, IEEE Transactions on Computers.

[10]  Jon Stearley,et al.  Defining and Measuring Supercomputer Reliability, Availability, and Serviceability (RAS) , 2005 .

[11]  Kuldip Singh,et al.  An Improved Duplication Strategy for Scheduling Precedence Constrained Graphs in Multiprocessor Systems , 2003, IEEE Trans. Parallel Distributed Syst..

[12]  Hamid Arabnejad,et al.  A Budget Constrained Scheduling Algorithm for Workflow Applications , 2014, Journal of Grid Computing.

[13]  Denis Trystram,et al.  Reliability versus performance for critical applications , 2009, J. Parallel Distributed Comput..

[14]  Arjan J. C. van Gemund,et al.  Fast and effective task scheduling in heterogeneous systems , 2000, Proceedings 9th Heterogeneous Computing Workshop (HCW 2000) (Cat. No.PR00556).

[15]  Emmanuel Jeannot,et al.  Optimizing performance and reliability on heterogeneous parallel systems: Approximation algorithms and heuristics , 2012, J. Parallel Distributed Comput..

[16]  Kenli Li,et al.  Reliability-aware scheduling strategy for heterogeneous distributed computing systems , 2010, J. Parallel Distributed Comput..

[17]  Mourad Hakem,et al.  Reliability and Scheduling on Systems Subject to Failures , 2007, 2007 International Conference on Parallel Processing (ICPP 2007).

[18]  Yun Zhou,et al.  The Reliability Wall for Exascale Supercomputing , 2012, IEEE Transactions on Computers.

[19]  J.-P. Wang,et al.  Task Allocation for Maximizing Reliability of Distributed Computer Systems , 1992, IEEE Trans. Computers.

[20]  Dharma P. Agrawal,et al.  A scalable task duplication based scheduling algorithm for heterogeneous systems , 2000, Proceedings 2000 International Conference on Parallel Processing.

[21]  Reda Alhajj,et al.  Replica Placement Strategies in Data Grid , 2008, Journal of Grid Computing.

[22]  Hai Jin,et al.  Dependable Grid Workflow Scheduling Based on Resource Availability , 2012, Journal of Grid Computing.

[23]  Inseong Song,et al.  Task Scheduling Algorithm with Minimal Redundant Duplications in Homogeneous Multiprocessor System , 2011, FGIT-GDC.

[24]  Lúcia Maria de A. Drummond,et al.  An efficient weighted bi-objective scheduling algorithm for heterogeneous systems , 2011, Parallel Comput..

[25]  Wang Jinxiang,et al.  An Objective-Flexible Clustering Algorithm for task mapping and scheduling on cluster-based NoC , 2010, 2010 Academic Symposium on Optoelectronics and Microelectronics Technology and 10th Chinese-Russian Symposium on Laser Physics and Laser TechnologyOptoelectronics Technology (ASOT).

[26]  Kenli Li,et al.  List scheduling with duplication for heterogeneous computing systems , 2010, J. Parallel Distributed Comput..

[27]  Viktor K. Prasanna,et al.  Heterogeneous computing: challenges and opportunities , 1993, Computer.

[28]  Salim Hariri,et al.  Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing , 2002, IEEE Trans. Parallel Distributed Syst..

[29]  Xiao Qin,et al.  A novel fault-tolerant scheduling algorithm for precedence constrained tasks in real-time heterogeneous systems , 2006, Parallel Comput..

[30]  Bharadwaj Veeravalli,et al.  On the Design of Fault-Tolerant Scheduling Strategies Using Primary-Backup Approach for Computational Grids with Low Replication Costs , 2009, IEEE Transactions on Computers.

[31]  P. P. Chakrabarti,et al.  Online Scheduling of Dynamic Task Graphs with Communication and Contention for Multiprocessors , 2012, IEEE Transactions on Parallel and Distributed Systems.

[32]  Depei Qian,et al.  MapReduce Workload Modeling with Statistical Approach , 2011, Journal of Grid Computing.

[33]  Bharadwaj Veeravalli,et al.  On the design of communication-aware fault-tolerant scheduling algorithms for precedence constrained tasks in grid computing systems with dedicated communication devices , 2009, J. Parallel Distributed Comput..

[34]  Emmanuel Jeannot,et al.  Bi-objective Approximation Scheme for Makespan and Reliability Optimization on Uniform Parallel Machines , 2008, Euro-Par.

[35]  Kwangsik Shin,et al.  Task scheduling algorithm using minimized duplications in homogeneous systems , 2008, J. Parallel Distributed Comput..

[36]  Nawwaf N. Kharma,et al.  A high performance algorithm for static task scheduling in heterogeneous distributed computing systems , 2008, J. Parallel Distributed Comput..

[37]  Laxmikant V. Kalé,et al.  A Fault Tolerance Protocol with Fast Fault Recovery , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[38]  Jing-Chiou Liou,et al.  An Efficient Task Clustering Heuristic for Scheduling DAGs on Multiprocessors , 2007 .

[39]  Laxmikant V. Kale,et al.  A Fault Tolerance Protocol for Fast Recovery , 2008 .

[40]  Zhiling Lan,et al.  Performance under Failures of DAG-based Parallel Computing , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[41]  Xiao Qin,et al.  A dynamic and reliability-driven scheduling algorithm for parallel real-time jobs executing on heterogeneous clusters , 2005, J. Parallel Distributed Comput..

[42]  Kouichi Sakurai,et al.  Fault-tolerant scheduling with dynamic number of replicas in heterogeneous systems , 2010, 2010 IEEE 12th International Conference on High Performance Computing and Communications (HPCC).