Emergent Failures: Rethinking Cloud Reliability at Scale

Since the conception of cloud computing, ensuring its ability to provide highly reliable service has been of the upmost importance and criticality to the business objectives of providers and their customers. This has held true for every facet of the system, encompassing applications, resource management, the underlying computing infrastructure, and environmental cooling. Thus, the cloud-computing and dependability research communities have exerted considerable effort toward enhancing the reliability of system components against various software and hardware failures. However, as these systems have continued to grow in scale, with heterogeneity and complexity resulting in the manifestation of emergent behavior, so too have their respective failures. Recent studies of production cloud datacenters indicate the existence of complex failure manifestations that existing fault tolerance and recovery strategies are ill-equipped to effectively handle. These strategies can even be responsible for such failures. These emergent failures-frequently transient and identifiable only at runtime-represent a significant threat to designing reliable cloud systems. This article identifies the challenges of emergent failures in cloud datacenters at scale and their impact on system resource management, and discusses potential directions of further study for Internet of Things integration and holistic fault tolerance.

[1]  Rajkumar Buyya,et al.  Container‐based cluster orchestration systems: A taxonomy and future directions , 2018, Softw. Pract. Exp..

[2]  Jie Xu,et al.  Straggler Root-Cause and Impact Analysis for Massive-scale Virtualized Cloud Datacenters , 2019, IEEE Transactions on Services Computing.

[3]  Ben Maurer Fail at scale , 2015, Commun. ACM.

[4]  Eric A. Brewer,et al.  Borg, Omega, and Kubernetes , 2016, ACM Queue.

[5]  Suhas N. Diggavi,et al.  Straggler Mitigation in Distributed Optimization Through Data Encoding , 2017, NIPS.

[6]  Chao Li,et al.  Fuxi: a Fault-Tolerant Resource Management and Job Scheduling System at Internet Scale , 2014, Proc. VLDB Endow..

[7]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[8]  Chao Li,et al.  ROSE: Cluster Resource Scheduling via Speculative Over-Subscription , 2018, 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS).

[9]  Wei Lin,et al.  Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing , 2014, OSDI.

[10]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[11]  Jie Xu,et al.  An Analysis of Failure-Related Energy Waste in a Large-Scale Cloud Environment , 2014, IEEE Transactions on Emerging Topics in Computing.

[12]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.

[13]  Jie Xu,et al.  Reliable Computing Service in Massive-Scale Systems through Rapid Low-Cost Failover , 2017, IEEE Transactions on Services Computing.

[14]  Randy H. Katz,et al.  Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[15]  Carl E. Landwehr,et al.  Basic concepts and taxonomy of dependable and secure computing , 2004, IEEE Transactions on Dependable and Secure Computing.

[16]  Abhishek Verma,et al.  Large-scale cluster management at Google with Borg , 2015, EuroSys.