论文信息 - Automating Chaos Experiments in Production

Automating Chaos Experiments in Production

Distributed systems often face transient errors and localized component degradation and failure. Verifying that the overall system remains healthy in the face of such failures is challenging. At Netflix, we have built a platform for automatically generating and executing chaos experiments, which check how well the production system can handle component failures and slowdowns. This paper describes the platform and our experiences operating it.

[1] Michael Nygard,et al. Release It!: Design and Deploy Production-Ready Software , 2017 .

[2] Nora Jones,et al. Building Confidence in System Behavior through Experiments , 2017 .

[3] Lorin Hochstein,et al. A Platform for Automating Chaos Experiments , 2016, 2016 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW).

[4] Donald Beaver,et al. Dapper, a Large-Scale Distributed Systems Tracing Infrastructure , 2010 .

[5] Jez Humble,et al. Continuous Delivery: Reliable Software Releases Through Build, Test, and Deployment Automation , 2010 .

[6] Tanakorn Leesatapornwongsa,et al. What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems , 2014, SoCC.

[7] Ruud C. M. de Rooij,et al. Chaos Engineering , 2017, IEEE Software.

[8] Sam Newman,et al. Building Microservices , 2015 .

[9] Zhenyun Zhuang,et al. RedLiner: Measuring Service Capacity with Live Production Traffic , 2017, 2017 IEEE International Conference on Web Services (ICWS).