Analysis of Frequently Failing Tasks and Rescheduling Strategy in the Cloud System

To better understand task failures in cloud computing systems, the authors analyze failure frequency of tasks based on Google cluster dataset, and find some frequently failing tasks that suffer from long-term failures and repeated rescheduling, which are called killer tasks as they can be a big concern of cloud systems. Hence there is a need to analyze killer tasks thoroughly and recognize them precisely. In this article, the authors first investigate resource usage pattern of killer tasks and analyze rescheduling strategies of killer tasks in Google cluster to find that repeated rescheduling causes large amount of resource wasting. Based on the above observations, they then propose an online killer task recognition service to recognize killer tasks at the very early stage of their occurrence so as to avoid unnecessary resource wasting. The experiment results show that the proposed service performs a 93.6% accuracy in recognizing killer tasks with an 87% timing advance and 86.6% resource saving for the cloud system averagely.

[1]  Xin Chen,et al.  Failure Analysis of Jobs in Compute Clouds: A Google Cluster Case Study , 2014, 2014 IEEE 25th International Symposium on Software Reliability Engineering.

[2]  Tiranee Achalakul,et al.  Failure Prediction of Data Centers Using Time Series and Fault Tree Analysis , 2012, 2012 IEEE 18th International Conference on Parallel and Distributed Systems.

[3]  Sofiène Tahar,et al.  Predicting Scheduling Failures in the Cloud: A Case Study with Google Clusters and Hadoop on Amazon EMR , 2015, 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems.

[4]  Franck Cappello,et al.  Characterizing Cloud Applications on a Google Data Center , 2013, 2013 42nd International Conference on Parallel Processing.

[5]  Radu Prodan,et al.  A New Fault Tolerance Heuristic for Scientific Workflows in Highly Distributed Environments Based on Resubmission Impact , 2009, 2009 Fifth IEEE International Conference on e-Science.

[6]  Andrea Rosà,et al.  Understanding the Dark Side of Big Data Clusters: An Analysis beyond Failures , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[7]  Hossein Deldari,et al.  Job failure prediction in grid environment based on workload characteristics , 2009, 2009 14th International CSI Computer Conference.

[8]  Shinji Kikuchi,et al.  Online failure prediction in cloud datacenters by real-time message pattern learning , 2012, 4th IEEE International Conference on Cloud Computing Technology and Science Proceedings.

[9]  Chita R. Das,et al.  Towards characterizing cloud backend workloads: insights from Google compute clusters , 2010, PERV.

[10]  Robert Birke,et al.  Failure Analysis of Virtual and Physical Machines: Patterns, Causes and Characteristics , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[11]  Jie Xu,et al.  An Empirical Failure-Analysis of a Large-Scale Cloud Computing Environment , 2014, 2014 IEEE 15th International Symposium on High-Assurance Systems Engineering.

[12]  Xin Chen,et al.  Failure Prediction of Jobs in Compute Clouds: A Google Cluster Case Study , 2014, 2014 IEEE International Symposium on Software Reliability Engineering Workshops.

[13]  Ziming Zhang,et al.  Ensemble of Bayesian Predictors and Decision Trees for Proactive Failure Management in Cloud Computing Systems , 2012, J. Commun..

[14]  Andrea Rosà,et al.  Predicting and Mitigating Jobs Failures in Big Data Clusters , 2015, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[15]  Pedro Capelastegui,et al.  An online failure prediction system for private IaaS platforms , 2013, DISCCO '13.

[16]  Yao Zhao,et al.  An efficient adaptive failure detection mechanism for cloud platform based on volterra series , 2014 .

[17]  Patrick E. McKnight,et al.  Mann‐Whitney U Test , 2010 .

[18]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[19]  Randy H. Katz,et al.  Heterogeneity and dynamicity of clouds at scale: Google trace analysis , 2012, SoCC '12.

[20]  Vanish Talwar,et al.  Statistical techniques for online anomaly detection in data centers , 2011, 12th IFIP/IEEE International Symposium on Integrated Network Management (IM 2011) and Workshops.