Gremlin: Systematic Resilience Testing of Microservices

Modern Internet applications are being disaggregated into a microservice-based architecture, with services being updated and deployed hundreds of times a day. The accelerated software life cycle and heterogeneity of language runtimes in a single application necessitates a new approach for testing the resiliency of these applications in production infrastructures. We present Gremlin, a framework for systematically testing the failure-handling capabilities of microservices. Gremlin is based on the observation that microservices are loosely coupled and thus rely on standard message-exchange patterns over the network. Gremlin allows the operator to easily design tests and executes them by manipulating inter-service messages at the network layer. We show how to use Gremlin to express common failure scenarios and how developers of an enterprise application were able to discover previously unknown bugs in their failure-handling code without modifying the application.

[1]  Roy Fielding,et al.  Architectural Styles and the Design of Network-based Software Architectures"; Doctoral dissertation , 2000 .

[2]  Jie Xu,et al.  WS-FIT: a tool for dependability analysis of Web services , 2004, Proceedings of the 28th Annual International Computer Software and Applications Conference, 2004. COMPSAC 2004..

[3]  Farnam Jahanian,et al.  Testing of fault-tolerant and real-time distributed systems via protocol fault injection , 1996, Proceedings of Annual Symposium on Fault Tolerant Computing.

[4]  Andrea Polini,et al.  A QoS Test-Bed Generator for Web Services , 2007, ICWE.

[5]  Haoxiang Lin,et al.  MODIST: Transparent Model Checking of Unmodified Distributed Systems , 2009, NSDI.

[6]  Michael Nygard,et al.  Release It!: Design and Deploy Production-Ready Software , 2017 .

[7]  Kang G. Shin,et al.  DOCTOR: an integrated software fault injection environment for distributed real-time systems , 1995, Proceedings of 1995 IEEE International Computer Performance and Dependability Symposium.

[8]  Thomas F. Wenisch,et al.  The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services , 2014, OSDI.

[9]  Andrea C. Arpaci-Dusseau,et al.  FATE and DESTINI: A Framework for Cloud Recovery Testing , 2011, NSDI.

[10]  Pallavi Joshi,et al.  SETSUDŌ: perturbation-based testing framework for scalable distributed systems , 2013, TRIOS@SOSP.

[11]  Jez Humble,et al.  Continuous Delivery: Reliable Software Releases Through Build, Test, and Deployment Automation , 2010 .

[12]  Andrew Glover,et al.  Continuous Integration: Improving Software Quality and Reducing Risk (The Addison-Wesley Signature Series) , 2007 .

[13]  Marcos K. Aguilera,et al.  Failure detection and consensus in the crash-recovery model , 2000, Distributed Computing.

[14]  Schahram Dustdar,et al.  Programmable Fault Injection Testbeds for Complex SOA , 2010, ICSOC.

[15]  Donald Beaver,et al.  Dapper, a Large-Scale Distributed Systems Tracing Infrastructure , 2010 .

[16]  John Sharp,et al.  Cloud Design Patterns: Prescriptive Architecture Guidance for Cloud Applications , 2014 .

[17]  Sam Newman,et al.  Building microservices - designing fine-grained systems, 1st Edition , 2015 .