RRFT: A Rank-Based Resource Aware Fault Tolerant Strategy for Cloud Platforms

The applications that are deployed in the cloud to provide services to the users encompass a large number of interconnected dependent cloud components. Multiple identical components are scheduled to run concurrently in order to handle unexpected failures and provide uninterrupted service to the end user, which introduces resource overhead problem for the cloud service provider. Furthermore such resource-intensive fault tolerant strategies bring extra monetary overhead to the cloud service provider and eventually to the cloud users. In order to address these issues, a novel fault tolerant strategy based on the significance level of each component is developed. The communication topology among the application components, their historical performance, failure rate, failure impact on other components, dependencies among them, etc., are used to rank those application components to further decide on the importance of one component over others. Based on the rank, a Markov Decision Process (MDP) model is presented to determine the number of replicas that varies from one component to another. A rigorous performance evaluation is carried out using some of the most common practically useful metrics such as, recovery time upon a fault, average number of components needed, number of parallel components successfully executed, etc., to quote a few, with similar component ranking and fault tolerant strategies. Simulation results demonstrate that the proposed algorithm reduces the required number of virtual and physical machines by approximately 10% and 4.2%, respectively, compared to other similar algorithms.

[1]  Xiaomin Zhu,et al.  Fault-Tolerant Scheduling for Hybrid Real-Time Tasks Based on CPB Model in Cloud , 2018, IEEE Access.

[2]  Jun Wei,et al.  FD4C: Automatic Fault Diagnosis Framework for Web Applications in Cloud Computing , 2016, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[3]  Zibin Zheng,et al.  Reliability-Based Design Optimization for Cloud Migration , 2014, IEEE Transactions on Services Computing.

[4]  Ewa Deelman,et al.  Dynamic and Fault-Tolerant Clustering for Scientific Workflows , 2016, IEEE Transactions on Cloud Computing.

[5]  Lihui Wang,et al.  Logistics-involved QoS-aware service composition in cloud manufacturing with deep reinforcement learning , 2021, Robotics Comput. Integr. Manuf..

[6]  Albert Y. Zomaya,et al.  Computation Offloading for Service Workflow in Mobile Cloud Computing , 2015, IEEE Transactions on Parallel and Distributed Systems.

[7]  Geoffrey G. Xie,et al.  Energy-Efficient Fault-Tolerant Data Storage and Processing in Mobile Cloud , 2015, IEEE Transactions on Cloud Computing.

[8]  Ahmad Khademzadeh,et al.  A survey of fault tolerance architecture in cloud computing , 2016, J. Netw. Comput. Appl..

[9]  Parmeet Kaur,et al.  A survey of fault tolerance in cloud computing , 2018, J. King Saud Univ. Comput. Inf. Sci..

[10]  Prasan Kumar Sahoo,et al.  DYVINE: Fitness-Based Dynamic Virtual Network Embedding in Cloud Computing , 2019, IEEE Journal on Selected Areas in Communications.

[11]  Weifa Liang,et al.  Fault tolerant placement of stateful VNFs and dynamic fault recovery in cloud networks , 2019, Comput. Networks.

[12]  K. Vinay,et al.  Fault-Tolerant Scheduling for Scientific Workflows in Cloud Environments , 2017, 2017 IEEE 7th International Advance Computing Conference (IACC).

[13]  Mohammed Amoon,et al.  Adaptive Framework for Reliable Cloud Computing Environment , 2016, IEEE Access.

[14]  Major Singh Goraya,et al.  Fault tolerance in cloud computing environment: A systematic survey , 2018, Comput. Ind..

[15]  Bharadwaj Veeravalli,et al.  LVRM: On the Design of Efficient Link Based Virtual Resource Management Algorithm for Cloud Platforms , 2018, IEEE Transactions on Parallel and Distributed Systems.

[16]  Zibouda Aliouat,et al.  Acceptance Test for Fault Detection in Component-based Cloud Computing and Systems , 2017, Future Gener. Comput. Syst..

[17]  Avirup Saha,et al.  Proactive Fault-Tolerance Technique to Enhance Reliability of Cloud Service in Cloud Federation Environment , 2020, IEEE Transactions on Cloud Computing.

[18]  Yue Yuan,et al.  Evaluation and optimization of the mixed redundancy strategy in cloud-based systems , 2016, China Communications.

[19]  Satya Prakash Ghrera,et al.  Power and Fault Aware Reliable Resource Allocation for Cloud Infrastructure , 2016 .

[20]  R T Anderson Reliability Design Handbook , 1976 .

[21]  S. Jaya Nirmala,et al.  An Efficient Fault Tolerant Workflow Scheduling Approach using Replication Heuristics and Checkpointing in the Cloud , 2018, J. Parallel Distributed Comput..

[22]  Xiaohui Gu,et al.  Ieee Transactions on Parallel and Distributed Systems (tpds) Perfcompass: Online Performance Anomaly Fault Localization and Inference in Infrastructure-as-a-service Clouds , 2022 .

[23]  Yongsheng Ding,et al.  Fault-tolerant elastic scheduling algorithm for workflow in Cloud systems , 2017, Inf. Sci..

[24]  Xiaomin Zhu,et al.  FESTAL: Fault-Tolerant Elastic Scheduling Algorithm for Real-Time Tasks in Virtualized Clouds , 2015, IEEE Transactions on Computers.

[25]  Yahya Slimani,et al.  A survey on cloud service description , 2017, J. Netw. Comput. Appl..

[26]  Yongsheng Ding,et al.  Using Imbalance Characteristic for Fault-Tolerant Workflow Scheduling in Cloud Systems , 2017, IEEE Transactions on Parallel and Distributed Systems.

[27]  Youlong Luo,et al.  Energy-efficient fault-tolerant replica management policy with deadline and budget constraints in edge-cloud environment , 2019, J. Netw. Comput. Appl..

[28]  Yun Yang,et al.  Formulating Criticality-Based Cost-Effective Fault Tolerance Strategies for Multi-Tenant Service-Based Systems , 2018, IEEE Transactions on Software Engineering.

[29]  Robert N. M. Watson,et al.  Firmament: Fast, Centralized Cluster Scheduling at Scale , 2016, OSDI.

[30]  Zibin Zheng,et al.  Component Ranking for Fault-Tolerant Cloud Applications , 2012, IEEE Transactions on Services Computing.

[31]  Guisheng Fan,et al.  Modeling and Analyzing Dynamic Fault-Tolerant Strategy for Deadline Constrained Task Scheduling in Cloud Computing , 2020, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[32]  Lixin Gao,et al.  A Fault-Tolerant Framework for Asynchronous Iterative Computations in Cloud Environments , 2018, IEEE Transactions on Parallel and Distributed Systems.