Reliability aware scheduling of bag of real time tasks in cloud environment

Cloud environment uses data center with a huge number of computational resources, and the probability of failing any of the resources increases with scale. Failures cause unavailability of services, which affects the reliability of the system. It is essential to consider the reliability issue for application deployment in the cloud, considering the failure of the resources. In this work, we address the reliability aware scheduling of tasks with hard deadlines in the cloud environment. We design, analyze and provide solutions for two special cases of the problem where (a) tasks have a common deadline on the machines with equal failure rate, and (b) tasks with equal execution time. For the general case of the problem, we propose two-phase heuristic approaches, one is the task ordering, and other is tasks mapping to machines. The performance of different task orderings and task mapping approaches is evaluated through simulation using synthetic and real traces. Based on the simulation result, the earliest due date ordering of tasks and mapping of the current task to the most reliable machine along with long task dropping performs better in general settings. We observe that task repetition and replication further improve the performance of the heuristics.

[1]  Liudong Xing,et al.  A Hierarchical Correlation Model for Evaluating Reliability, Performance, and Power Consumption of a Cloud Service , 2016, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[2]  Olivier Beaumont,et al.  Reliable Service Allocation in Clouds , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[3]  Mohammad Zulkernine,et al.  A Reliability-Based Resource Allocation Approach for Cloud Computing , 2017, 2017 IEEE 7th International Symposium on Cloud and Service Computing (SC2).

[4]  Keqin Li,et al.  Resource Consumption Cost Minimization of Reliable Parallel Applications on Heterogeneous Embedded Systems , 2017, IEEE Transactions on Industrial Informatics.

[5]  Eugene L. Lawler,et al.  Scheduling a Single Machine to Minimize the Number of Late Jobs , 1983 .

[6]  Archana Ganapathi,et al.  The Case for Evaluating MapReduce Performance Using Workload Suites , 2011, 2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems.

[7]  Anand Sivasubramaniam,et al.  Critical event prediction for proactive management in large-scale computer clusters , 2003, KDD '03.

[8]  James H. Laros,et al.  Evaluating the viability of process replication reliability for exascale systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[9]  Peter Brucker,et al.  Scheduling Algorithms , 1995 .

[10]  Cheng-Zhong Xu,et al.  Exploring event correlation for failure prediction in coalitions of clusters , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[11]  Dakai Zhu,et al.  On Maximizing Reliability of Real-Time Embedded Applications Under Hard Energy Constraint , 2010, IEEE Transactions on Industrial Informatics.

[12]  Mark S. Squillante,et al.  Failure data analysis of a large-scale heterogeneous server environment , 2004, International Conference on Dependable Systems and Networks, 2004.

[13]  Xiaofei Xu,et al.  An utility-based job scheduling algorithm for Cloud computing considering reliability factor , 2011, 2011 International Conference on Cloud and Service Computing.

[14]  Yun Yang,et al.  Robust Scheduling of Scientific Workflows with Deadline and Budget Constraints in Clouds , 2014, 2014 IEEE 28th International Conference on Advanced Information Networking and Applications.

[15]  Peter Brucker Minimizing maximum lateness in a two-machine unit-time job shop , 2005, Computing.

[16]  Yanpei Chen,et al.  Interactive Analytical Processing in Big Data Systems: A Cross-Industry Study of MapReduce Workloads , 2012, Proc. VLDB Endow..

[17]  Xiaodong Liu,et al.  A speculative approach to spatial-temporal efficiency with multi-objective optimization in a heterogeneous cloud environment , 2016, Secur. Commun. Networks.

[18]  Giorgio C. Buttazzo,et al.  Limited Preemptive Scheduling for Real-Time Systems. A Survey , 2013, IEEE Transactions on Industrial Informatics.

[19]  S. Martello,et al.  Dynamic Programming and Strong Bounds for the 0-1 Knapsack Problem , 1999 .

[20]  Kashi Venkatesh Vishwanath,et al.  Characterizing cloud computing hardware reliability , 2010, SoCC '10.

[21]  Mark S. Squillante,et al.  Performance Implications of Failures in Large-Scale Cluster Scheduling , 2004, JSSPP.

[22]  François Jammes,et al.  Service-oriented paradigms in industrial automation , 2005, IEEE Transactions on Industrial Informatics.

[23]  Van-Anh Truong,et al.  Availability in Globally Distributed Storage Systems , 2010, OSDI.

[24]  S. M. Shatz,et al.  Models and algorithms for reliability-oriented task-allocation in redundant distributed-computer systems , 1989 .

[25]  Fumio Machida,et al.  Redundant virtual machine placement for fault-tolerant consolidated server clusters , 2010, 2010 IEEE Network Operations and Management Symposium - NOMS 2010.