Fault-Tolerant Scheduling for Real-Time Scientific Workflows with Elastic Resource Provisioning in Virtualized Clouds

Clouds are becoming an important platform for scientific workflow applications. However, with many nodes being deployed in clouds, managing reliability of resources becomes a critical issue, especially for the real-time scientific workflow execution where deadlines should be satisfied. Therefore, fault tolerance in clouds is extremely essential. The PB (primary backup) based scheduling is a popular technique for fault tolerance and has effectively been used in the cluster and grid computing. However, applying this technique for real-time workflows in a virtualized cloud is much more complicated and has rarely been studied. In this paper, we address this problem. We first establish a real-time workflow fault-tolerant model that extends the traditional PB model by incorporating the cloud characteristics. Based on this model, we develop approaches for task allocation and message transmission to ensure faults can be tolerated during the workflow execution. Finally, we propose a dynamic fault-tolerant scheduling algorithm, FASTER, for realtime workflows in the virtualized cloud. FASTER has three key features: 1) it employs a backward shifting method to make full use of the idle resources and incorporates task overlapping and VM migration for high resource utilization, 2) it applies the vertical/horizontal scaling-up technique to quickly provision resources for a burst of workflows, and 3) it uses the vertical scaling-down scheme to avoid unnecessary and ineffective resource changes due to fluctuated workflow requests. We evaluate our FASTER algorithm with synthetic workflows and workflows collected from the real scientific and business applications and compare it with six baseline algorithms. The experimental results demonstrate that FASTER can effectively improve the resource utilization and schedulability even in the presence of node failures in virtualized clouds.

[1]  Tatsuhiro Tsuchiya,et al.  A new fault-tolerant scheduling technique for real-time multiprocessor systems , 1995, Proceedings Second International Workshop on Real-Time Computing Systems and Applications.

[2]  Rami G. Melhem,et al.  Fault-Tolerance Through Scheduling of Aperiodic Tasks in Hard Real-Time Multiprocessor Systems , 1997, IEEE Trans. Parallel Distributed Syst..

[3]  C. Siva Ram Murthy,et al.  A Fault-Tolerant Dynamic Scheduling Algorithm for Multiprocessor Real-Time Systems and Its Analysis , 1998, IEEE Trans. Parallel Distributed Syst..

[4]  Daniel Mossé,et al.  A responsiveness approach for scheduling fault recovery in real-time systems , 1999, Proceedings of the Fifth IEEE Real-Time Technology and Applications Symposium.

[5]  Swapna S. Gokhale,et al.  An efficient method to schedule tandem of real-time tasks in cluster computing with possible processor failures , 2003, Proceedings of the Eighth IEEE Symposium on Computers and Communications. ISCC 2003.

[6]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[7]  G. Manimaran,et al.  Efficient overloading techniques for primary-backup scheduling in real-time systems , 2004, J. Parallel Distributed Comput..

[8]  G. Manimaran,et al.  An adaptive scheme for fault-tolerant scheduling of soft real-time tasks in multiprocessor systems , 2001, J. Parallel Distributed Comput..

[9]  Xiao Qin,et al.  A novel fault-tolerant scheduling algorithm for precedence constrained tasks in real-time heterogeneous systems , 2006, Parallel Comput..

[10]  Wei Sun,et al.  Hybrid Overloading and Stochastic Analysis for Redundant Real-time Multiprocessor Systems , 2007, 2007 26th IEEE International Symposium on Reliable Distributed Systems (SRDS 2007).

[11]  Thomas Fahringer,et al.  Fault-tolerant behavior in state-of-the-art grid workflow management systems , 2007 .

[12]  Weihua Gui,et al.  Fault-tolerant scheduling for real-time embedded control systems , 2004, Journal of Computer Science and Technology.

[13]  N. Nagaveni,et al.  Design and Implementation of an Efficient Two-level Scheduler for Cloud Computing Environment , 2009, 2009 International Conference on Advances in Recent Technologies in Communication and Computing.

[14]  Rajkumar Buyya,et al.  Article in Press Future Generation Computer Systems ( ) – Future Generation Computer Systems Cloud Computing and Emerging It Platforms: Vision, Hype, and Reality for Delivering Computing as the 5th Utility , 2022 .

[15]  Jeffrey Dean,et al.  Designs, Lessons and Advice from Building Large Distributed Systems , 2009 .

[16]  Qin Zheng Improving MapReduce fault tolerance in the cloud , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[17]  Randy H. Katz,et al.  A view of cloud computing , 2010, CACM.

[18]  Wei-Jen Wang,et al.  Heuristic Scheduling Strategies for Linear-Dependent and Independent Jobs on Heterogeneous Grids , 2011, FGIT-GDC.

[19]  Marty Humphrey,et al.  Auto-scaling to minimize cost and meet application deadlines in cloud workflows , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[20]  Xiaomin Zhu,et al.  Boosting adaptivity of fault-tolerant scheduling for real-time tasks with service requirements on clusters , 2011, J. Syst. Softw..

[21]  Rajkumar Buyya,et al.  CloudSim: a toolkit for modeling and simulation of cloud computing environments and evaluation of resource provisioning algorithms , 2011, Softw. Pract. Exp..

[22]  Xiaomin Zhu,et al.  QoS-Aware Fault-Tolerant Scheduling for Real-Time Tasks on Heterogeneous Clusters , 2011, IEEE Transactions on Computers.

[23]  Jarek Nabrzyski,et al.  Cost- and deadline-constrained provisioning for scientific workflow ensembles in IaaS clouds , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[24]  R. Prodan,et al.  Meeting Soft Deadlines in Scientific Workflows Using Resubmission Impact , 2012, IEEE Transactions on Parallel and Distributed Systems.

[25]  Yudi Wei,et al.  QoS Guarantees and Service Differentiation for Dynamic Cloud Applications , 2013, IEEE Transactions on Network and Service Management.

[26]  Ann L. Chervenak,et al.  Characterizing and profiling scientific workflows , 2013, Future Gener. Comput. Syst..

[27]  Rajkumar Buyya,et al.  Deadline Based Resource Provisioningand Scheduling Algorithm for Scientific Workflows on Clouds , 2014, IEEE Transactions on Cloud Computing.

[28]  Muthu Dayalan,et al.  MapReduce : Simplified Data Processing on Large Cluster , 2018 .