Predicting Node Failures in an Ultra-Large-Scale Cloud Computing Platform

Many software services today are hosted on cloud computing platforms, such as Amazon EC2, due to many benefits like reduced operational costs. However, node failures in these platforms can impact the availability of their hosted services and potentially lead to large financial losses. Predicting node failures before they actually occur is crucial, as it enables DevOps engineers to minimize their impact by performing preventative actions. However, such predictions are hard due to many challenges like the enormous size of the monitoring data and the complexity of the failure symptoms. AIOps (Artificial Intelligence for IT Operations), a recently introduced approach in DevOps, leverages data analytics and machine learning to improve the quality of computing platforms in a cost-effective manner. However, the successful adoption of such AIOps solutions requires much more than a top-performing machine learning model. Instead, AIOps solutions must be trustable, interpretable, maintainable, scalable, and evaluated in context. To cope with these challenges, in this article we report our process of building an AIOps solution for predicting node failures for an ultra-large-scale cloud computing platform at Alibaba. We expect our experiences to be of value to researchers and practitioners, who are interested in building and maintaining AIOps solutions for large-scale cloud computing platforms.

[1]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[2]  Dongmei Zhang,et al.  Identifying impactful service system problems via log analysis , 2018, ESEC/SIGSOFT FSE.

[3]  Jürgen Schmidhuber,et al.  LSTM: A Search Space Odyssey , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[4]  Pedro M. Domingos A few useful things to know about machine learning , 2012, Commun. ACM.

[5]  Bin Nie,et al.  Fill-in the gaps: Spatial-temporal models for missing data , 2017, 2017 13th International Conference on Network and Service Management (CNSM).

[6]  Qiang Fu,et al.  Correlating events with time series for incident diagnosis , 2014, KDD.

[7]  Din J. Wasem,et al.  Mining of Massive Datasets , 2014 .

[8]  Dongmei Zhang,et al.  iDice: Problem Identification for Emerging Issues , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[9]  Shane McIntosh,et al.  Automated Parameter Optimization of Classification Techniques for Defect Prediction Models , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[10]  Tim Menzies,et al.  "Better Data" is Better than "Better Data Miners" (Benefits of Tuning SMOTE for Defect Prediction) , 2017, ICSE.

[11]  Qiang Fu,et al.  Mining Historical Issue Repositories to Heal Large-Scale Online Service Systems , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[12]  Ahmed E. Hassan,et al.  The Impact of Class Rebalancing Techniques on the Performance and Interpretation of Defect Prediction Models , 2018, IEEE Transactions on Software Engineering.

[13]  Dongmei Zhang,et al.  Predicting Node failure in cloud service systems , 2018, ESEC/SIGSOFT FSE.

[14]  Brandon M. Greenwell,et al.  Interpretable Machine Learning , 2019, Hands-On Machine Learning with R.

[15]  D. Sculley,et al.  Hidden Technical Debt in Machine Learning Systems , 2015, NIPS.

[16]  Shane McIntosh,et al.  The Impact of Automated Parameter Optimization on Defect Prediction Models , 2018, IEEE Transactions on Software Engineering.

[17]  Max Kuhn,et al.  Applied Predictive Modeling , 2013 .

[18]  Qiang Fu,et al.  Healing online service systems via mining historical issue repositories , 2012, 2012 Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering.

[19]  Naoyasu Ubayashi,et al.  A Study of the Quality-Impacting Practices of Modern Code Review at Sony Mobile , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering Companion (ICSE-C).

[20]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[21]  Ying Zou,et al.  An Industrial Case Study on the Automated Detection of Performance Regressions in Heterogeneous Environments , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[22]  Tie-Yan Liu,et al.  Learning to rank for information retrieval , 2009, SIGIR.

[23]  Audris Mockus,et al.  Software Dependencies, Work Dependencies, and Their Impact on Failures , 2009, IEEE Transactions on Software Engineering.

[24]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[25]  Tore Dybå,et al.  A systematic review of effect size in software engineering experiments , 2007, Inf. Softw. Technol..

[26]  Bianca Schroeder,et al.  Practical scrubbing: Getting to the bad sector at the right time , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[27]  Evgenia Smirni,et al.  Managing Data Center Tickets: Prediction and Active Sizing , 2016, 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[28]  Mohamed G. Gouda,et al.  Accelerated heartbeat protocols , 1998, Proceedings. 18th International Conference on Distributed Computing Systems (Cat. No.98CB36183).

[29]  Shane McIntosh,et al.  An Empirical Comparison of Model Validation Techniques for Defect Prediction Models , 2017, IEEE Transactions on Software Engineering.

[30]  Ying Zou,et al.  The Use of Summation to Aggregate Software Metrics Hinders the Performance of Defect Prediction Models , 2017, IEEE Transactions on Software Engineering.

[31]  Qiang Fu,et al.  Experience report on applying software analytics in incident management of online service , 2017, Automated Software Engineering.

[32]  Kishor S. Trivedi,et al.  Performance Assurance via Software Rejuvenation: Monitoring, Statistics and Algorithms , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[33]  Shane McIntosh,et al.  Revisiting the Impact of Classification Techniques on the Performance of Defect Prediction Models , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[34]  Andrew Warfield,et al.  Live migration of virtual machines , 2005, NSDI.

[35]  L. Arockiam,et al.  Cloud Computing Survey , 2014 .

[36]  Bianca Schroeder,et al.  Learning from Failure Across Multiple Clusters: A Trace-Driven Approach to Understanding, Predicting, and Mitigating Job Terminations , 2017, 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS).

[37]  Evgenia Smirni,et al.  Spatial–Temporal Prediction Models for Active Ticket Managing in Data Centers , 2018, IEEE Transactions on Network and Service Management.

[38]  Towards trustable machine learning , 2018, Nature Biomedical Engineering.

[39]  Sven Apel,et al.  Finding Faster Configurations Using FLASH , 2018, IEEE Transactions on Software Engineering.

[40]  Jaechang Nam,et al.  Deep Semantic Feature Learning for Software Defect Prediction , 2020, IEEE Transactions on Software Engineering.

[41]  Ahmed E. Hassan,et al.  An Experience Report on Defect Modelling in Practice: Pitfalls and Challenges , 2017, 2018 IEEE/ACM 40th International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP).

[42]  Qiang Fu,et al.  Software analytics for incident management of online services: An experience report , 2013, 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[43]  Shwetabh Khanduja,et al.  Learning a Hierarchical Monitoring System for Detecting and Diagnosing Service Issues , 2015, KDD.

[44]  Qiang Fu,et al.  Identifying Recurrent and Unknown Performance Issues , 2014, 2014 IEEE International Conference on Data Mining.

[45]  Dietmar Jannach,et al.  Are we really making much progress? A worrying analysis of recent neural recommendation approaches , 2019, RecSys.