Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions

This technical report is an extended version of our OSDI 2020 paper: Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions. Abstract: When a failure occurs in production systems, the highest priority is to quickly mitigate it. Despite its importance, failure mitigation is done in a reactive and ad-hoc way: taking some fixed actions only after a severe symptom is observed. For cloud systems, such a strategy is inadequate. In this paper, we propose a preventive and adaptive failure mitigation service, NARYA, that is integrated in a production cloud, Microsoft Azure’s compute platform. Narya predicts imminent host failures based on multi-layer system signals and then decides smart mitigation actions. The goal is to avert VM failures. Narya’s decision engine takes a novel online experimentation approach to continually explore the best mitigation action. Narya further enhances the adaptive decision capability through reinforcement learning. Narya has been running in production for 15 months. It on average reduces VM interruptions by 26% compared to previous static strategy.

[1]  George Candea,et al.  Microreboot - A Technique for Cheap Recovery , 2004, OSDI.

[2]  Noah Treuhaft,et al.  Recovery Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies , 2002 .

[3]  Welch Bl THE GENERALIZATION OF ‘STUDENT'S’ PROBLEM WHEN SEVERAL DIFFERENT POPULATION VARLANCES ARE INVOLVED , 1947 .

[4]  Amer Diwan,et al.  Meaningful Availability , 2020, NSDI.

[5]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[6]  Christof Fetzer,et al.  Perfect Failure Detection in Timed Asynchronous Systems , 2003, IEEE Trans. Computers.

[7]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[8]  Robbert van Renesse,et al.  A Gossip-Style Failure Detection Service , 2009 .

[9]  Ron Kohavi,et al.  The Surprising Power of Online Experiments , 2017 .

[10]  Yee Whye Teh,et al.  Set Transformer , 2018, ICML.

[11]  Weisong Shi,et al.  Making Disk Failure Predictions SMARTer! , 2020, FAST.

[12]  Peng Huang,et al.  13th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2018, Carlsbad, CA, USA, October 8-10, 2018 , 2018, OSDI.

[13]  Murali Chintalapati,et al.  Gandalf: An Intelligent, End-To-End Analytics Service for Safe Deployment in Large-Scale Cloud Infrastructure , 2020, NSDI.

[14]  Marcos K. Aguilera,et al.  On the quality of service of failure detectors , 2000, Proceeding International Conference on Dependable Systems and Networks. DSN 2000.

[15]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[16]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[17]  Yingwei Luo,et al.  Failure Recovery: When the Cure Is Worse Than the Disease , 2013, HotOS.

[18]  Tie-Yan Liu,et al.  LightGBM: A Highly Efficient Gradient Boosting Decision Tree , 2017, NIPS.

[19]  Andreas Haeberlen,et al.  PeerReview: practical accountability for distributed systems , 2007, SOSP.

[20]  Dongmei Zhang,et al.  Predicting Node failure in cloud service systems , 2018, ESEC/SIGSOFT FSE.

[21]  Marcos K. Aguilera,et al.  Detecting failures in distributed systems with the Falcon spy network , 2011, SOSP.

[22]  Marcos K. Aguilera,et al.  No Time for Asynchrony , 2009, HotOS.

[23]  Scott Lundberg,et al.  A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[24]  Karan Gupta,et al.  IASO: A Fail-Slow Detection and Mitigation Framework for Distributed Storage Services , 2019, USENIX Annual Technical Conference.

[25]  Peng Li,et al.  Improving Service Availability of Cloud Systems by Predicting Disk Error , 2018, USENIX ATC.

[26]  Andrew Warfield,et al.  Live migration of virtual machines , 2005, NSDI.

[27]  Xin Wu,et al.  NetPilot: automating datacenter network failure mitigation , 2012, SIGCOMM '12.

[28]  Marcos K. Aguilera,et al.  Improving Availability in Distributed Systems with Failure Informers , 2013, NSDI.

[29]  Scott F. Smith,et al.  Understanding, Detecting and Localizing Partial Failures in Large System Software , 2020, NSDI.

[30]  Greg Hamerly,et al.  Bayesian approaches to failure prediction for disk drives , 2001, ICML.

[31]  Peng Huang,et al.  Gray Failure: The Achilles' Heel of Cloud-Scale Systems , 2017, HotOS.

[32]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[33]  Shipra Agrawal,et al.  Thompson Sampling for Contextual Bandits with Linear Payoffs , 2012, ICML.

[34]  Ricardo Bianchini,et al.  Resource Central: Understanding and Predicting Workloads for Improved Resource Management in Large Cloud Platforms , 2017, SOSP.

[35]  Robert B. Ross,et al.  Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems , 2018, FAST.

[36]  Jeffrey C. Mogul,et al.  Nines are Not Enough: Meaningful Metrics for Clouds , 2019, HotOS.

[37]  Lantao Yu,et al.  Dynamic Attention Deep Model for Article Recommendation by Learning Human Editors' Demonstration , 2017, KDD.