Tritium: A Cross-layer Analytics System for Enhancing Microservice Rollouts in the Cloud

Microservice architectures are widely used in cloud-native applications as their modularity allows for independent development and deployment of components. With the many complex interactions occurring in between components, it is difficult to determine the effects of a particular microservice rollout. Site Reliability Engineers must be able to determine with confidence whether a new rollout is at fault for a concurrent or subsequent performance problem in the system so they can quickly mitigate the issue. We present Tritium, a cross-layer analytics system that synthesizes several types of data to suggest possible causes for Service Level Objective (SLO) violations in microservice applications. It uses event data to identify new version rollouts, tracing data to build a topology graph for the cluster and determine services potentially affected by the rollout, and causal impact analysis applied to metric time-series to determine if the rollout is at fault. Tritium works based on the principle that if a rollout is not responsible for a change in an upstream or neighboring SLO metric, then the rollout's telemetry data will do a poor job predicting the behavior of that SLO metric. In this paper, we experimentally demonstrate that Tritium can accurately attribute SLO violations to downstream rollouts and outline the steps necessary to fully realize Tritium.

[1]  Qiang Fu,et al.  Correlating events with time series for incident diagnosis , 2014, KDD.

[2]  Xiaofeng He,et al.  ?-Diagnosis: Unsupervised and Real-time Diagnosis of Small- window Long-tail Latency in Large-scale Microservice Platforms , 2019, WWW.

[3]  Dan Ding,et al.  Graph-based trace analysis for microservice architecture understanding and problem diagnosis , 2020, ESEC/SIGSOFT FSE.

[4]  Yuan He,et al.  Seer: Leveraging Big Data to Navigate the Complexity of Performance Debugging in Cloud Microservices , 2019, ASPLOS.

[5]  Subho Sankar Banerjee,et al.  FIRM: An Intelligent Fine-Grained Resource Management Framework for SLO-Oriented Microservices , 2020, OSDI.

[6]  Qingfeng Du,et al.  A Causality Mining and Knowledge Graph Based Method of Root Cause Diagnosis for Performance Anomaly in Cloud Applications , 2020, Applied Sciences.

[7]  Johan Tordsson,et al.  MicroRCA: Root Cause Localization of Performance Issues in Microservices , 2020, NOMS 2020 - 2020 IEEE/IFIP Network Operations and Management Symposium.

[8]  Christof Fetzer,et al.  Sieve: Actionable Insights from Monitored Metrics in Microservices , 2017, ArXiv.

[9]  Richard Berntsson-Svensson,et al.  On the journey to continuous deployment: Technical and social challenges along the way , 2015, Inf. Softw. Technol..

[10]  Jun Sun,et al.  Latent error prediction and fault localization for microservice applications by learning from system trace logs , 2019, ESEC/SIGSOFT FSE.

[11]  C. Granger Investigating causal relations by econometric models and cross-spectral methods , 1969 .

[12]  Erez Zadok,et al.  DARC: dynamic analysis of root causes of latency distributions , 2008, SIGMETRICS '08.

[13]  Scott Lundberg,et al.  A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[14]  Pengfei Chen,et al.  CauseInfer: Automatic and distributed performance diagnosis with hierarchical causality graph in large distributed systems , 2014, IEEE INFOCOM 2014 - IEEE Conference on Computer Communications.

[15]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[16]  Leonardo Mariani,et al.  Localizing Faults in Cloud Systems , 2018, 2018 IEEE 11th International Conference on Software Testing, Verification and Validation (ICST).

[17]  Steven L. Scott,et al.  Inferring causal impact using Bayesian structural time-series models , 2015, 1506.00356.