App-Bisect: Autonomous Healing for Microservice-Based Apps

The microservice and DevOps approach to software design has resulted in new software features being delivered immediately to users, instead of waiting for long refresh cycles. On the downside, software bugs and performance regressions have now become an important cause of downtime. We propose app-bisect, an autonomous tool to troubleshoot and repair such software issues in production environments. Our insight is that the evolution of microservices in an application can be captured as mutations to the graph of microservice dependencies, such that a particular version of the graph from the past can be deployed automatically, as an interim measure until the problem is permanently fixed. Using canary testing and version-aware routing techniques, we describe how the search process can be sped up to identify such a candidate version. We present the overall design and key challenges towards implementing such a system.

[1]  M. Salehie,et al.  Autonomic computing , 2005, ACM SIGSOFT Softw. Eng. Notes.

[2]  Sam Shah,et al.  Root cause detection in a service-oriented architecture , 2013, SIGMETRICS '13.

[3]  Yu Luo,et al.  lprof: A Non-intrusive Request Flow Profiler for Distributed Systems , 2014, OSDI.

[4]  Michael Nygard,et al.  Release It!: Design and Deploy Production-Ready Software , 2017 .

[5]  Dinkar Sitaram,et al.  Platform as a Service , 2012, CloudCom 2012.

[6]  Jeff Magee,et al.  Self-Managed Systems: an Architectural Challenge , 2007, Future of Software Engineering (FOSE '07).

[7]  V. Issarny,et al.  Service Substitution Revisited , 2009, 2009 IEEE/ACM International Conference on Automated Software Engineering.

[8]  Marcos K. Aguilera,et al.  Performance debugging for distributed systems of black boxes , 2003, SOSP '03.

[9]  Randy H. Katz,et al.  X-Trace: A Pervasive Network Tracing Framework , 2007, NSDI.

[10]  Valérie Issarny,et al.  Dynamic Service Substitution in Service-Oriented Architectures , 2008, 2008 IEEE Congress on Services - Part I.

[11]  Thomas F. Wenisch,et al.  The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services , 2014, OSDI.

[12]  Mahadev Konar,et al.  ZooKeeper: Wait-free Coordination for Internet-scale Systems , 2010, USENIX ATC.

[13]  Naftaly H. Minsky,et al.  On conditions for self-healing in distributed software systems , 2003, 2003 Autonomic Computing Workshop.

[14]  Debanjan Ghosh,et al.  Self-healing systems - survey and synthesis , 2007, Decis. Support Syst..

[15]  Benjamin Satzger,et al.  Adaptive Self-optimization in Distributed Dynamic Environments , 2007, First International Conference on Self-Adaptive and Self-Organizing Systems (SASO 2007).

[16]  Steven D. Gribble,et al.  Configuration Debugging as Search: Finding the Needle in the Haystack , 2004, OSDI.

[17]  Jeffrey O. Kephart,et al.  The Vision of Autonomic Computing , 2003, Computer.

[18]  Donald Beaver,et al.  Dapper, a Large-Scale Distributed Systems Tracing Infrastructure , 2010 .

[19]  Eric A. Brewer,et al.  Pinpoint: problem determination in large, dynamic Internet services , 2002, Proceedings International Conference on Dependable Systems and Networks.