Predicting and Mitigating Jobs Failures in Big Data Clusters

In large-scale data enters, software and hardware failures are frequent, resulting in failures of job executions that may cause significant resource waste and performance deterioration. To proactively minimize the resource inefficiency due to job failures, it is important to identify them in advance using key job attributes. However, so far, prevailing research on datacenter workload characterization has overlooked job failures, including their patterns, root causes, and impact. In this paper, we aim to develop prediction models and mitigation policies for unsuccessful jobs, so as to reduce the resource waste in big data enters. In particular, we base our analysis on Google cluster traces, consisting of a large number of big-data jobs with a high task fan-out. We first identify the time-varying patterns of failed jobs and the contributing system features. Based on our characterization study, we develop an on-line predictive model for job failures by applying various statistical learning techniques, namely Linear Discriminate Analysis (LDA), Quadratic Discriminate Analysis (QDA), and Logistic Regression (LR). Furthermore, we propose a delay-based mitigation policy which, after a certain grace period, proactively terminates the execution of jobs that are predicted to fail. The particular objective of postponing job terminations is to strike a good tradeoffs between resource waste and false prediction of successful jobs. Our evaluation results show that the proposed method is able to significantly reduce the resource waste by 41.9% on average, and keep false terminations of jobs low, i.e., only 1%.

[1]  Yuanyuan Zhou,et al.  Learning from mistakes: a comprehensive study on real world concurrency bug characteristics , 2008, ASPLOS.

[2]  Navendu Jain,et al.  Demystifying the dark side of the middle: a field study of middlebox failures in datacenters , 2013, Internet Measurement Conference.

[3]  Andrea Rosà,et al.  Quantifying the Brown Side of Priority Schedulers: Lessons from Big Clusters , 2014, PERV.

[4]  Sheng Di,et al.  Characterization and Comparison of Cloud versus Grid Workloads , 2012, 2012 IEEE International Conference on Cluster Computing.

[5]  Franck Cappello,et al.  Characterizing Cloud Applications on a Google Data Center , 2013, 2013 42nd International Conference on Parallel Processing.

[6]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[7]  Andrew S. Tanenbaum,et al.  Distributed systems: Principles and Paradigms , 2001 .

[8]  Anand Sivasubramaniam,et al.  BlueGene/L Failure Analysis and Prediction Models , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[9]  Andrea Rosà,et al.  Understanding the Dark Side of Big Data Clusters: An Analysis beyond Failures , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[10]  Eduardo Pinheiro,et al.  DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.

[11]  P. Bak,et al.  Learning from mistakes , 1997, Neuroscience.

[12]  Luiz André Barroso,et al.  Web Search for a Planet: The Google Cluster Architecture , 2003, IEEE Micro.

[13]  Yang Liu,et al.  Be conservative: enhancing failure diagnosis with proactive logging , 2012, OSDI 2012.

[14]  Kashi Venkatesh Vishwanath,et al.  Characterizing cloud computing hardware reliability , 2010, SoCC '10.

[15]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[16]  Ravishankar K. Iyer,et al.  Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[17]  Ilias Iliadis,et al.  Effect of Latent Errors on the Reliability of Data Storage Systems , 2013, 2013 IEEE 21st International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems.

[18]  HölzleUrs,et al.  Web Search for a Planet , 2003 .

[19]  Rahul Potharaju,et al.  When the network crumbles: an empirical study of cloud network failures and their impact on services , 2013, SoCC.

[20]  Sangyeun Cho,et al.  Characterizing Machines and Workloads on a Google Cluster , 2012, 2012 41st International Conference on Parallel Processing Workshops.

[21]  Bianca Schroeder,et al.  Disk Failures in the Real World: What Does an MTTF of 1, 000, 000 Hours Mean to You? , 2007, FAST.

[22]  Yuanyuan Zhou,et al.  Efficient online validation with delta execution , 2009, ASPLOS.

[23]  Laura L. Pullum,et al.  Software Fault Tolerance Techniques and Implementation , 2001 .

[24]  Yuanyuan Zhou,et al.  Sweeper: a lightweight end-to-end system for defending against fast worms , 2007, EuroSys '07.

[25]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[26]  Randy H. Katz,et al.  Heterogeneity and dynamicity of clouds at scale: Google trace analysis , 2012, SoCC '12.

[27]  Evangelos Eleftheriou,et al.  Disk scrubbing versus intra-disk redundancy for high-reliability raid storage systems , 2008, SIGMETRICS '08.

[28]  Robert Birke,et al.  Failure Analysis of Virtual and Physical Machines: Patterns, Causes and Characteristics , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[29]  Van-Anh Truong,et al.  Availability in Globally Distributed Storage Systems , 2010, OSDI.

[30]  Cristina L. Abad,et al.  Natjam: design and evaluation of eviction policies for supporting priorities and deadlines in mapreduce clusters , 2013, SoCC.

[31]  Ilias Iliadis Reliability modeling of RAID storage systems with latent errors , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.

[32]  Andrea Rosà,et al.  Demystifying Casualties of Evictions in Big Data Priority Scheduling , 2015, PERV.