Emergent Failures: Rethinking Cloud Reliability at Scale
暂无分享,去创建一个
Rajkumar Buyya | Rajiv Ranjan | Zhenyu Wen | Jie Xu | Peter Garraghan | Alexander Romanovsky | Renyu Yang | R. Buyya | R. Ranjan | Jie Xu | P. Garraghan | A. Romanovsky | Renyu Yang | Z. Wen
[1] Rajkumar Buyya,et al. Container‐based cluster orchestration systems: A taxonomy and future directions , 2018, Softw. Pract. Exp..
[2] Jie Xu,et al. Straggler Root-Cause and Impact Analysis for Massive-scale Virtualized Cloud Datacenters , 2019, IEEE Transactions on Services Computing.
[3] Ben Maurer. Fail at scale , 2015, Commun. ACM.
[4] Eric A. Brewer,et al. Borg, Omega, and Kubernetes , 2016, ACM Queue.
[5] Suhas N. Diggavi,et al. Straggler Mitigation in Distributed Optimization Through Data Encoding , 2017, NIPS.
[6] Chao Li,et al. Fuxi: a Fault-Tolerant Resource Management and Job Scheduling System at Internet Scale , 2014, Proc. VLDB Endow..
[7] Carlo Curino,et al. Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.
[8] Chao Li,et al. ROSE: Cluster Resource Scheduling via Speculative Over-Subscription , 2018, 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS).
[9] Wei Lin,et al. Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing , 2014, OSDI.
[10] Bianca Schroeder,et al. A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.
[11] Jie Xu,et al. An Analysis of Failure-Related Energy Waste in a Large-Scale Cloud Environment , 2014, IEEE Transactions on Emerging Topics in Computing.
[12] Luiz André Barroso,et al. The tail at scale , 2013, CACM.
[13] Jie Xu,et al. Reliable Computing Service in Massive-Scale Systems through Rapid Low-Cost Failover , 2017, IEEE Transactions on Services Computing.
[14] Randy H. Katz,et al. Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.
[15] Carl E. Landwehr,et al. Basic concepts and taxonomy of dependable and secure computing , 2004, IEEE Transactions on Dependable and Secure Computing.
[16] Abhishek Verma,et al. Large-scale cluster management at Google with Borg , 2015, EuroSys.