Microscope: Pinpoint Performance Issues with Causal Graphs in Micro-service Environments

Driven by the emerging business models (e.g., digital sales) and IT technologies (e.g., DevOps and Cloud computing), the architecture of software is shifting from monolithic to microservice rapidly. Benefit from microservice, software development, and delivery processes are accelerated significantly. However, along with many micro services running in the dynamic cloud environment with complex interactions, identifying and locating the abnormal services are extraordinarily difficult. This paper presents a novel system named “Microscope” to identify and locate the abnormal services with a ranked list of possible root causes in Micro-service environments. Without instrumenting the source code of micro services, Microscope can efficiently construct a service causal graph and infer the causes of performance problems in real time. Experimental evaluations in a micro-service benchmark environment show that Microscope achieves a good diagnosis result, i.e., 88% in precision and 80% in recall, which is higher than several state-of-the-art methods. Meanwhile, it has a good scalability to adapt to large-scale micro-service systems.

[1]  C. Granger Investigating Causal Relations by Econometric Models and Cross-Spectral Methods , 1969 .

[2]  Richard Mortier,et al.  Using Magpie for Request Extraction and Workload Modelling , 2004, OSDI.

[3]  Christof Fetzer,et al.  Sieve: actionable insights from monitored metrics in distributed systems , 2017, Middleware.

[4]  Hiranya Jayathilaka,et al.  Performance Monitoring and Root Cause Analysis for Cloud-hosted Web Applications , 2017, WWW.

[5]  Erik Elmroth,et al.  Performance Anomaly Detection and Bottleneck Identification , 2015, ACM Comput. Surv..

[6]  Peter Bühlmann,et al.  Robustification of the PC-Algorithm for Directed Acyclic Graphs , 2008 .

[7]  Randy H. Katz,et al.  X-Trace: A Pervasive Network Tracing Framework , 2007, NSDI.

[8]  Jeffrey S. Chase,et al.  Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control , 2004, OSDI.

[9]  Chita R. Das,et al.  CloudPD: Problem determination and diagnosis in shared dynamic clouds , 2013, 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[10]  W. Wong,et al.  Learning Causal Bayesian Network Structures From Experimental Data , 2008 .

[11]  Pengfei Chen,et al.  CauseInfer: Automatic and distributed performance diagnosis with hierarchical causality graph in large distributed systems , 2014, IEEE INFOCOM 2014 - IEEE Conference on Computer Communications.

[12]  Paramvir Bahl,et al.  Detailed diagnosis in enterprise networks , 2009, SIGCOMM '09.

[13]  Eric A. Brewer,et al.  Pinpoint: problem determination in large, dynamic Internet services , 2002, Proceedings International Conference on Dependable Systems and Networks.

[14]  Ping Wang,et al.  CloudRanger: Root Cause Identification for Cloud Native Systems , 2018, 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[15]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[16]  Thomas F. Wenisch,et al.  The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services , 2014, OSDI.

[17]  Armando Fox,et al.  Fingerprinting the datacenter: automated classification of performance crises , 2010, EuroSys '10.

[18]  Peter Bühlmann,et al.  Estimating High-Dimensional Directed Acyclic Graphs with the PC-Algorithm , 2007, J. Mach. Learn. Res..

[19]  Sam Shah,et al.  Root cause detection in a service-oriented architecture , 2013, SIGMETRICS '13.