Dynamic fault tolerant scheduling with response time minimization for multiple failures in cloud

Abstract With the increasing demand for large amount of computing resources, the cloud is widely used for executing large number of independent tasks. In order to successfully execute more tasks and maximize the revenues, the cloud service providers (CSPs) should provide reliable services, while maximizing the resource utilization. Providing better Quality of Service (QoS), while maximizing the resource utilization in the event of failures is a critical research issue which needs to be addressed. In this paper, an Elastic pull-based Dynamic Fault Tolerant (E-DFT) scheduling mechanism is designed for minimizing the response time while executing the backups during multiple failures of independent tasks. A basic core primary backup model is also used and integrated with the backup tasks overlapping (BTO) and backup tasks fusion (BTF) techniques to tolerate multiple simultaneous failures. Simulation results show that the proposed E-DFT scheduling can achieve better performance in terms of guarantee ratio and resource utilization over other existing scheduling algorithms.

[1]  Bharadwaj Veeravalli,et al.  On the Design of Fault-Tolerant Scheduling Strategies Using Primary-Backup Approach for Computational Grids with Low Replication Costs , 2009, IEEE Transactions on Computers.

[2]  Nirmeen A. El-Bahnasawy,et al.  On the design of reactive approach with flexible checkpoint interval to tolerate faults in cloud computing systems , 2018, J. Ambient Intell. Humaniz. Comput..

[3]  MengChu Zhou,et al.  Dynamic Cloud Task Scheduling Based on a Two-Stage Strategy , 2018, IEEE Transactions on Automation Science and Engineering.

[4]  Lee Gillam,et al.  Managing energy, performance and cost in large scale heterogeneous datacenters using migrations , 2019, Future Gener. Comput. Syst..

[5]  Jehad Al-Omari,et al.  A Planning Approach for Reassigning Virtual Machines in IaaS Clouds , 2020, IEEE Transactions on Cloud Computing.

[6]  Albert Y. Zomaya,et al.  Secure authentication and load balancing of distributed edge datacenters , 2019, J. Parallel Distributed Comput..

[7]  Alexander L. Stolyar,et al.  Online VM Auto-Scaling Algorithms for Application Hosting in a Cloud , 2020, IEEE Transactions on Cloud Computing.

[8]  MengChu Zhou,et al.  TTSA: An Effective Scheduling Approach for Delay Bounded Tasks in Hybrid Clouds , 2017, IEEE Transactions on Cybernetics.

[9]  Wei Tan,et al.  Temporal Task Scheduling With Constrained Service Delay for Profit Maximization in Hybrid Clouds , 2017, IEEE Transactions on Automation Science and Engineering.

[10]  Xiaomin Zhu,et al.  Towards energy-efficient scheduling for real-time tasks under uncertain cloud computing environment , 2015, J. Syst. Softw..

[11]  Vijayan Sugumaran,et al.  Task scheduling techniques in cloud computing: A literature survey , 2019, Future Gener. Comput. Syst..

[12]  Xiaomin Zhu,et al.  FESTAL: Fault-Tolerant Elastic Scheduling Algorithm for Real-Time Tasks in Virtualized Clouds , 2015, IEEE Transactions on Computers.

[13]  Arun Kumar Sangaiah,et al.  Energy-Aware Fault-Tolerant Dynamic Task Scheduling Scheme for Virtualized Cloud Data Centers , 2018, Mobile Networks and Applications.

[14]  G. Cumming,et al.  Researchers misunderstand confidence intervals and standard error bars. , 2005, Psychological methods.

[15]  Helen D. Karatza,et al.  Scheduling real‐time bag‐of‐tasks applications with approximate computations in SaaS clouds , 2020, Concurr. Comput. Pract. Exp..

[16]  Ahmad Khademzadeh,et al.  A survey of fault tolerance architecture in cloud computing , 2016, J. Netw. Comput. Appl..

[17]  Sophie Quinton,et al.  Weakly-hard Real-time Guarantees for Earliest Deadline First Scheduling of Independent Tasks , 2019, ACM Trans. Embed. Comput. Syst..

[18]  Liang Luo,et al.  Improving Failure Tolerance in Large-Scale Cloud Computing Systems , 2019, IEEE Transactions on Reliability.

[19]  Vijay K. Garg,et al.  Fault Tolerance in Distributed Systems Using Fused Data Structures , 2013, IEEE Transactions on Parallel and Distributed Systems.

[20]  Ying Zhang,et al.  DCloud: Deadline-Aware Resource Allocation for Cloud Computing Jobs , 2016, IEEE Transactions on Parallel and Distributed Systems.

[21]  Jibo Wei,et al.  Scheduling directed acyclic graphs with optimal duplication strategy on homogeneous multiprocessor systems , 2020, J. Parallel Distributed Comput..

[22]  Amir Masoud Rahmani,et al.  Load-balancing algorithms in cloud computing: A survey , 2017, J. Netw. Comput. Appl..

[23]  Ravishankar K. Iyer,et al.  Failure Diagnosis for Distributed Systems Using Targeted Fault Injection , 2017, IEEE Transactions on Parallel and Distributed Systems.

[24]  Ravishankar K. Iyer,et al.  Analysis and Diagnosis of SLA Violations in a Production SaaS Cloud , 2017, IEEE Trans. Reliab..

[25]  Rajkumar Buyya,et al.  CloudSim: a toolkit for modeling and simulation of cloud computing environments and evaluation of resource provisioning algorithms , 2011, Softw. Pract. Exp..

[26]  Rajiv Misra,et al.  On demand clock synchronization for live VM migration in distributed cloud data centers , 2020, J. Parallel Distributed Comput..

[27]  Shafii Muhammad Abdulhamid,et al.  Fault tolerance aware scheduling technique for cloud computing environment using dynamic clustering algorithm , 2016, Neural Computing and Applications.

[28]  Lei Wang Architecture-Based Reliability-Sensitive Criticality Measure for Fault-Tolerance Cloud Applications , 2019, IEEE Transactions on Parallel and Distributed Systems.