Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions
暂无分享,去创建一个
Pu Zhao | Murali Chintalapati | Randolph Yao | Zheng Mu | Sebastien Levy | Youjiang Wu | Yingnong Dang | Peng Huang | Tarun Ramani | Naga Govindaraju | Xukun Li | Qingwei Lin | Gil Lapid Shafriri | N. Govindaraju | Yingnong Dang | Peng Huang | Xukun Li | Qingwei Lin | Pu Zhao | Murali Chintalapati | Zheng Mu | Randolph Yao | Tarun Ramani | Youjiang Wu | Sebastien Levy
[1] George Candea,et al. Microreboot - A Technique for Cheap Recovery , 2004, OSDI.
[2] Noah Treuhaft,et al. Recovery Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies , 2002 .
[3] Welch Bl. THE GENERALIZATION OF ‘STUDENT'S’ PROBLEM WHEN SEVERAL DIFFERENT POPULATION VARLANCES ARE INVOLVED , 1947 .
[4] Amer Diwan,et al. Meaningful Availability , 2020, NSDI.
[5] W. R. Thompson. ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .
[6] Christof Fetzer,et al. Perfect Failure Detection in Timed Asynchronous Systems , 2003, IEEE Trans. Computers.
[7] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.
[8] Robbert van Renesse,et al. A Gossip-Style Failure Detection Service , 2009 .
[9] Ron Kohavi,et al. The Surprising Power of Online Experiments , 2017 .
[10] Yee Whye Teh,et al. Set Transformer , 2018, ICML.
[11] Weisong Shi,et al. Making Disk Failure Predictions SMARTer! , 2020, FAST.
[12] Peng Huang,et al. 13th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2018, Carlsbad, CA, USA, October 8-10, 2018 , 2018, OSDI.
[13] Murali Chintalapati,et al. Gandalf: An Intelligent, End-To-End Analytics Service for Safe Deployment in Large-Scale Cloud Infrastructure , 2020, NSDI.
[14] Marcos K. Aguilera,et al. On the quality of service of failure detectors , 2000, Proceeding International Conference on Dependable Systems and Networks. DSN 2000.
[15] J. Friedman. Greedy function approximation: A gradient boosting machine. , 2001 .
[16] Sam Toueg,et al. Unreliable failure detectors for reliable distributed systems , 1996, JACM.
[17] Yingwei Luo,et al. Failure Recovery: When the Cure Is Worse Than the Disease , 2013, HotOS.
[18] Tie-Yan Liu,et al. LightGBM: A Highly Efficient Gradient Boosting Decision Tree , 2017, NIPS.
[19] Andreas Haeberlen,et al. PeerReview: practical accountability for distributed systems , 2007, SOSP.
[20] Dongmei Zhang,et al. Predicting Node failure in cloud service systems , 2018, ESEC/SIGSOFT FSE.
[21] Marcos K. Aguilera,et al. Detecting failures in distributed systems with the Falcon spy network , 2011, SOSP.
[22] Marcos K. Aguilera,et al. No Time for Asynchrony , 2009, HotOS.
[23] Scott Lundberg,et al. A Unified Approach to Interpreting Model Predictions , 2017, NIPS.
[24] Karan Gupta,et al. IASO: A Fail-Slow Detection and Mitigation Framework for Distributed Storage Services , 2019, USENIX Annual Technical Conference.
[25] Peng Li,et al. Improving Service Availability of Cloud Systems by Predicting Disk Error , 2018, USENIX ATC.
[26] Andrew Warfield,et al. Live migration of virtual machines , 2005, NSDI.
[27] Xin Wu,et al. NetPilot: automating datacenter network failure mitigation , 2012, SIGCOMM '12.
[28] Marcos K. Aguilera,et al. Improving Availability in Distributed Systems with Failure Informers , 2013, NSDI.
[29] Scott F. Smith,et al. Understanding, Detecting and Localizing Partial Failures in Large System Software , 2020, NSDI.
[30] Greg Hamerly,et al. Bayesian approaches to failure prediction for disk drives , 2001, ICML.
[31] Peng Huang,et al. Gray Failure: The Achilles' Heel of Cloud-Scale Systems , 2017, HotOS.
[32] Wei Chu,et al. A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.
[33] Shipra Agrawal,et al. Thompson Sampling for Contextual Bandits with Linear Payoffs , 2012, ICML.
[34] Ricardo Bianchini,et al. Resource Central: Understanding and Predicting Workloads for Improved Resource Management in Large Cloud Platforms , 2017, SOSP.
[35] Robert B. Ross,et al. Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems , 2018, FAST.
[36] Jeffrey C. Mogul,et al. Nines are Not Enough: Meaningful Metrics for Clouds , 2019, HotOS.
[37] Lantao Yu,et al. Dynamic Attention Deep Model for Article Recommendation by Learning Human Editors' Demonstration , 2017, KDD.