Maximizing Reliability of Data-Intensive Workflow Systems with Active Fault Tolerance Schemes in Cloud

Most existing researches on cloud workflow systems have focused on resource scheduling with the aims to minimize system delay under budget constraints or optimize system cost under deadline constraints. However, cloud providers cannot guarantee a failure-free cloud environment, a compact scheduling plan is prone to failure, thus, workflow system reliability has been identified as a critical and challenging issue in the volatile cloud environment. With the ability of cloud, it is easy for users to implement the active fault tolerance schemes, e.g., Scale-Out. However, it will lead to issues like security problem and extra management cost. In this paper, we first investigate Scale-Up and Scale-Hybrid schemes to fully explore the possibilities offered by the ability of cloud. We formally model the problem of optimizing the reliability of a cloud workflow system under budget constraints with these three fault-tolerance schemes. These optimization problems are discrete and non-convex. Thus, we propose a genetic algorithm based method for workflow fault tolerance (GA4WFT). Finally, we evaluate the effectiveness and efficiency of proposed GA4WFT with three different fault-tolerance schemes through experiments conducted on Amazon EC2 data.

[1]  Kashi Nath Dey,et al.  Forecasting of software reliability using neighborhood fuzzy particle swarm optimization based novel neural network , 2019, IEEE/CAA Journal of Automatica Sinica.

[2]  MengChu Zhou,et al.  An Effective Scheme for QoS Estimation via Alternating Direction Method-Based Matrix Factorization , 2019, IEEE Transactions on Services Computing.

[3]  Yves Robert,et al.  Fault tolerant scheduling of precedence task graphs on heterogeneous platforms , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[4]  Lin Yang,et al.  A methodology for reliability of WSN based on software defined network in adaptive industrial environment , 2018, IEEE/CAA Journal of Automatica Sinica.

[5]  Qiang He,et al.  Performance-Aware Cost-Effective Resource Provisioning for Future Grid IoT-Cloud System , 2019, Journal of Energy Engineering.

[6]  Rajkumar Buyya,et al.  Enhancing Reliability of Workflow Execution Using Task Replication and Spot Instances , 2016, ACM Trans. Auton. Adapt. Syst..

[7]  Qiang He,et al.  Keyword Search for Building Service-Based Systems , 2017, IEEE Transactions on Software Engineering.

[8]  Mohammad Zulkernine,et al.  Preventing Cache-Based Side-Channel Attacks in a Cloud Environment , 2014, IEEE Transactions on Cloud Computing.

[9]  Hai Jin,et al.  Formulating Cost-Effective Monitoring Strategies for Service-Based Systems , 2014, IEEE Transactions on Software Engineering.

[10]  Gang Zeng,et al.  Quantitative Fault-Tolerance for Reliable Workflows on Heterogeneous IaaS Clouds , 2020, IEEE Transactions on Cloud Computing.

[11]  Ali Movaghar-Rahimabadi,et al.  A Fault Tolerant Scheduling Algorithm for DAG Applications in Cluster Environments , 2011, ICDIPC.

[12]  Hai Jin,et al.  Localizing Runtime Anomalies in Service-Oriented Systems , 2017, IEEE Transactions on Services Computing.

[13]  Maw-Sheng Chern,et al.  On the computational complexity of reliability redundancy allocation in a series system , 1992, Oper. Res. Lett..

[14]  Qingsheng Zhu,et al.  Deadline-Constrained Cost Optimization Approaches for Workflow Scheduling in Clouds , 2017, IEEE Transactions on Parallel and Distributed Systems.

[15]  Bharadwaj Veeravalli,et al.  On the Design of Fault-Tolerant Scheduling Strategies Using Primary-Backup Approach for Computational Grids with Low Replication Costs , 2009, IEEE Transactions on Computers.

[16]  R. Prodan,et al.  Meeting Soft Deadlines in Scientific Workflows Using Resubmission Impact , 2012, IEEE Transactions on Parallel and Distributed Systems.

[17]  Ewa Deelman,et al.  Dynamic and Fault-Tolerant Clustering for Scientific Workflows , 2016, IEEE Transactions on Cloud Computing.

[18]  Kouichi Sakurai,et al.  Fault-tolerant scheduling with dynamic number of replicas in heterogeneous systems , 2010, 2010 IEEE 12th International Conference on High Performance Computing and Communications (HPCC).

[19]  Sagar Dhakal,et al.  Maximizing Service Reliability in Distributed Computing Systems with Random Node Failures: Theory and Implementation , 2010, IEEE Transactions on Parallel and Distributed Systems.

[20]  Kouichi Sakurai,et al.  Reliable workflow scheduling with less resource redundancy , 2013, Parallel Comput..

[21]  Xiao Qin,et al.  An efficient fault-tolerant scheduling algorithm for real-time tasks with precedence constraints in heterogeneous systems , 2002, Proceedings International Conference on Parallel Processing.

[22]  Yongsheng Ding,et al.  Using Imbalance Characteristic for Fault-Tolerant Workflow Scheduling in Cloud Systems , 2017, IEEE Transactions on Parallel and Distributed Systems.