论文信息 - Metastable failures in distributed systems

Metastable failures in distributed systems

We describe metastable failures---a failure pattern in distributed systems. Currently, metastable failures manifest themselves as black swan events; they are outliers because nothing in the past points to their possibility, have a severe impact, and are much easier to explain in hindsight than to predict. Although instances of metastable failures can look different at the surface, deeper analysis shows that they can be understood within the same framework. We introduce a framework for thinking about metastable failures, apply it to examples observed during years of operating distributed systems at scale, and survey ad-hoc techniques developed post-factum for making systems resilient to known metastable failures. A systematic approach for building systems that are robust against unknown meta-stable failures remains an open problem.

Abutalib Aghayev | Aleksey Charapko | Timothy Zhu | Nathan Bronson

[1] Shan Lu,et al. ScaleCheck: A Single-Machine Approach for Discovering Scalability Bugs in Large Distributed Systems , 2019, FAST.

[2] John K. Ousterhout,et al. In Search of an Understandable Consensus Algorithm , 2014, USENIX ATC.

[3] Michael Nygard,et al. Release It!: Design and Deploy Production-Ready Software , 2017 .

[4] Wonho Kim,et al. Kraken: Leveraging Live Traffic Tests to Identify and Resolve Resource Utilization Bottlenecks in Large Scale Web Services , 2016, OSDI.

[5] Jim Gray,et al. The convoy phenomenon , 1979, OPSR.

[6] Niall Murphy,et al. Site Reliability Engineering: How Google Runs Production Systems , 2016 .

[7] Tanakorn Leesatapornwongsa,et al. Limplock: understanding the impact of limpware on scale-out cloud systems , 2013, SoCC.

[8] Edward A. Ashcroft,et al. Proving Assertions about Parallel Programs , 1975, J. Comput. Syst. Sci..

[9] Leslie Lamport,et al. The part-time parliament , 1998, TOCS.

[10] Kang G. Shin,et al. Maestro: quality-of-service in large disk arrays , 2011, ICAC '11.

[11] Joe Armstrong,et al. Making reliable distributed systems in the presence of software errors , 2003 .