Hunting Killer Tasks for Cloud System through Behavior Pattern Learning

Motivated by frequent failures in cloud computing systems, we analyze failure frequency and continuity of tasks from the Google cloud cluster, and find what we call killer tasks that suffer from frequent failures and repeated rescheduling. Killer task can be a big concern in cloud systems as it causes unnecessary resource wasting and significant increase of scheduling workloads. In this paper, we investigate characteristics and behavior patterns of killer tasks, then develop an approach to recognize killer tasks at the very early stage of their occurrence so that they can be addressed proactively instead of being rescheduled repeatedly. The empirical results show that our approach performs at 97% of precision in recognizing killer tasks with a maximal 1,164 minutes of lead time and 89% of resource saving for the cloud system on average.

[1]  Mansaf Alam,et al.  Analysis and Clustering of Workload in Google Cluster Trace Based on Resource Usage , 2015, 2016 IEEE Intl Conference on Computational Science and Engineering (CSE) and IEEE Intl Conference on Embedded and Ubiquitous Computing (EUC) and 15th Intl Symposium on Distributed Computing and Applications for Business Engineering (DCABES).

[2]  Sofiène Tahar,et al.  Predicting Scheduling Failures in the Cloud , 2015, ArXiv.

[3]  Randy H. Katz,et al.  Heterogeneity and dynamicity of clouds at scale: Google trace analysis , 2012, SoCC '12.

[4]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[5]  Franck Cappello,et al.  Characterizing Cloud Applications on a Google Data Center , 2013, 2013 42nd International Conference on Parallel Processing.

[6]  Chita R. Das,et al.  Towards characterizing cloud backend workloads: insights from Google compute clusters , 2010, PERV.