Predicting Node failure in cloud service systems

In recent years, many traditional software systems have migrated to cloud computing platforms and are provided as online services. The service quality matters because system failures could seriously affect business and user experience. A cloud service system typically contains a large number of computing nodes. In reality, nodes may fail and affect service availability. In this paper, we propose a failure prediction technique, which can predict the failure-proneness of a node in a cloud service system based on historical data, before node failure actually happens. The ability to predict faulty nodes enables the allocation and migration of virtual machines to the healthy nodes, therefore improving service availability. Predicting node failure in cloud service systems is challenging, because a node failure could be caused by a variety of reasons and reflected by many temporal and spatial signals. Furthermore, the failure data is highly imbalanced. To tackle these challenges, we propose MING, a novel technique that combines: 1) a LSTM model to incorporate the temporal data, 2) a Random Forest model to incorporate spatial data; 3) a ranking model that embeds the intermediate results of the two models as feature inputs and ranks the nodes by their failure-proneness, 4) a cost-sensitive function to identify the optimal threshold for selecting the faulty nodes. We evaluate our approach using real-world data collected from a cloud service system. The results confirm the effectiveness of the proposed approach. We have also successfully applied the proposed approach in real industrial practice.

[1]  Audris Mockus,et al.  Towards building a universal defect prediction model with rank transformed predictors , 2016, Empirical Software Engineering.

[2]  Gerardo Canfora,et al.  Defect prediction as a multiobjective optimization problem , 2015, Softw. Test. Verification Reliab..

[3]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[4]  Rongxin Wu,et al.  Dealing with noise in defect prediction , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[5]  Tim Menzies,et al.  Heterogeneous Defect Prediction , 2015, IEEE Transactions on Software Engineering.

[6]  Kashi Venkatesh Vishwanath,et al.  Characterizing cloud computing hardware reliability , 2010, SoCC '10.

[7]  Peng Li,et al.  Improving Service Availability of Cloud Systems by Predicting Disk Error , 2018, USENIX ATC.

[8]  Rami Bahsoon,et al.  Chapter 1 – Relating System Quality and Software Architecture: Foundations and Approaches , 2014 .

[9]  Ken-ichi Matsumoto,et al.  Comments on “Researcher Bias: The Use of Machine Learning in Software Defect Prediction” , 2016, IEEE Transactions on Software Engineering.

[10]  Oshry Ben-Harush,et al.  Predicting HDD failures from compound SMART attributes , 2017, SYSTOR.

[11]  Tim Menzies,et al.  Data Mining Static Code Attributes to Learn Defect Predictors , 2007, IEEE Transactions on Software Engineering.

[12]  Sashank Dara,et al.  Online Defect Prediction for Imbalanced Data , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[13]  Xin Yao,et al.  A Learning-to-Rank Approach to Software Defect Prediction , 2015, IEEE Transactions on Reliability.

[14]  Wenzhi Chen,et al.  Smart Ring: A Model of Node Failure Detection in High Available Cloud Data Center , 2012, NPC.

[15]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[16]  Tracy Hall,et al.  A Systematic Literature Review on Fault Prediction Performance in Software Engineering , 2012, IEEE Transactions on Software Engineering.

[17]  Georgios Gousios,et al.  Alitheia Core: An extensible software quality monitoring platform , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[18]  Norman E. Fenton,et al.  Quantitative Analysis of Faults and Failures in a Complex Software System , 2000, IEEE Trans. Software Eng..

[19]  T. S. Eugene Ng,et al.  Understanding the effects and implications of compute node related failures in hadoop , 2012, HPDC '12.

[20]  Song Wang,et al.  Automatically Learning Semantic Features for Defect Prediction , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[21]  Eduardo Pinheiro,et al.  Failure Trends in a Large Disk Drive Population , 2007, FAST.

[22]  Van-Anh Truong,et al.  Availability in Globally Distributed Storage Systems , 2010, OSDI.

[23]  Xinli Yang,et al.  Deep Learning for Just-in-Time Defect Prediction , 2015, 2015 IEEE International Conference on Software Quality, Reliability and Security.

[24]  Liam O'Brien,et al.  Evaluation of Commercial Cloud Services : A Systematic Literature Review , 2018 .

[25]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[26]  Shane McIntosh,et al.  Automated Parameter Optimization of Classification Techniques for Defect Prediction Models , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[27]  Jasmina Bogojeska,et al.  Predicting Disk Replacement towards Reliable Data Centers , 2016, KDD.

[28]  Baowen Xu,et al.  Heterogeneous cross-company defect prediction by unified metric representation and CCA-based transfer learning , 2015, ESEC/SIGSOFT FSE.

[29]  Peng Huang,et al.  Gray Failure: The Achilles' Heel of Cloud-Scale Systems , 2017, HotOS.

[30]  Haoxiang Lin,et al.  An Empirical Study on Quality Issues of Production Big Data Platform , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[31]  Tie-Yan Liu Learning to Rank for Information Retrieval , 2009, Found. Trends Inf. Retr..

[32]  Scott Tilley,et al.  Software Testing in the Cloud: Perspectives on an Emerging Discipline , 2012 .

[33]  Lars Grunske,et al.  A comparison of machine learning algorithms for proactive hard disk drive failure detection , 2013, ISARCS '13.

[34]  Rami Bahsoon,et al.  Relating System Quality and Software Architecture , 2014, Relating System Quality and Software Architecture.

[35]  Navendu Jain,et al.  Understanding network failures in data centers: measurement, analysis, and implications , 2011, SIGCOMM.

[36]  Michele Lanza,et al.  Evaluating defect prediction approaches: a benchmark and an extensive comparison , 2011, Empirical Software Engineering.

[37]  Xavier Franch,et al.  A survey on quality attributes in service-based systems , 2015, Software Quality Journal.

[38]  Hongyu Zhang On the Distribution of Software Faults , 2008, IEEE Transactions on Software Engineering.

[39]  Xiuzhen Zhang,et al.  Comments on "Data Mining Static Code Attributes to Learn Defect Predictors" , 2007, IEEE Trans. Software Eng..

[40]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[41]  Hongyu Zhang,et al.  An investigation of the relationships between lines of code and defects , 2009, 2009 IEEE International Conference on Software Maintenance.

[42]  Ahmed E. Hassan,et al.  Replicating and Re-Evaluating the Theory of Relative Defect-Proneness , 2015, IEEE Transactions on Software Engineering.

[43]  Rami Bahsoon,et al.  Relating System Quality and Software Architecture , 2014, Relating System Quality and Software Architecture.

[44]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[45]  Andrew Warfield,et al.  Live migration of virtual machines , 2005, NSDI.

[46]  Christopher J. C. Burges,et al.  From RankNet to LambdaRank to LambdaMART: An Overview , 2010 .