Root-Cause Metric Location for Microservice Systems via Log Anomaly Detection

Microservice systems are typically fragile and failures are inevitable in them due to their complexity and large scale. However, it is challenging to localize the root-cause metric due to its complicated dependencies and the huge number of various metrics. Existing methods are based on either correlation between metrics or correlation between metrics and failures. All of them ignore the key data source in microservice, i.e., logs. In this paper, we propose a novel root-cause metric localization approach by incorporating log anomaly detection. Our approach is based on a key observation, the value of root-cause metric should be changed along with the change of the log anomaly score of the system caused by the failure. Specifically, our approach includes two components, collecting anomaly scores by log anomaly detection algorithm and identifying root-cause metric by robust correlation analysis with data augmentation. Experiments on an open-source benchmark microservice system have demonstrated our approach can identify root-cause metrics more accurately than existing methods and only require a short localization time. Therefore, our approach can assist engineers to save much effort in diagnosing and mitigating failures as soon as possible.

[1]  Ferat Sahin,et al.  A survey on feature selection methods , 2014, Comput. Electr. Eng..

[2]  Zibin Zheng,et al.  Drain: An Online Log Parsing Approach with Fixed Depth Tree , 2017, 2017 IEEE International Conference on Web Services (ICWS).

[3]  Ingo Weber,et al.  Experience report: Anomaly detection of cloud application operations using log and cloud metric correlation analysis , 2015, 2015 IEEE 26th International Symposium on Software Reliability Engineering (ISSRE).

[4]  Liang Tang,et al.  Mining temporal lag from fluctuating events for correlation and root cause analysis , 2014, 10th International Conference on Network and Service Management (CNSM) and Workshop.

[5]  Risto Vaarandi,et al.  Mining event logs with SLCT and LogHound , 2008, NOMS 2008 - 2008 IEEE Network Operations and Management Symposium.

[6]  Qiang Fu,et al.  Correlating events with time series for incident diagnosis , 2014, KDD.

[7]  S. Joe Qin,et al.  Data-driven root cause diagnosis of faults in process industries , 2016, Chemometrics and Intelligent Laboratory Systems.

[8]  M. Schilling Multivariate Two-Sample Tests Based on Nearest Neighbors , 1986 .

[9]  Qing Zhao,et al.  Data-driven root-cause fault diagnosis for multivariate non-linear processes , 2018 .

[10]  Xiaofeng He,et al.  ?-Diagnosis: Unsupervised and Real-time Diagnosis of Small- window Long-tail Latency in Large-scale Microservice Platforms , 2019, WWW.

[11]  Jian Li,et al.  An Evaluation Study on Log Parsing and Its Use in Log Mining , 2016, 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[12]  Cesare Pautasso,et al.  Microservices in Practice, Part 1: Reality Check and Service Design , 2017, IEEE Software.

[13]  Yu Zhang,et al.  Log Clustering Based Problem Identification for Online Service Systems , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering Companion (ICSE-C).

[14]  Dan Pei,et al.  Opprentice: Towards Practical and Automatic Anomaly Detection Through Machine Learning , 2015, Internet Measurement Conference.

[15]  Zibin Zheng,et al.  Tools and Benchmarks for Automated Log Parsing , 2018, 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP).

[16]  Dan Ding,et al.  Fault Analysis and Debugging of Microservice Systems: Industrial Survey, Benchmark System, and Empirical Study , 2018, IEEE Transactions on Software Engineering.

[17]  Tao Li,et al.  LogSig: generating system events from raw textual logs , 2011, CIKM '11.

[18]  Qiang Fu,et al.  Mining program workflow from interleaved traces , 2010, KDD.

[19]  Brian C. Ross Mutual Information between Discrete and Continuous Data Sets , 2014, PloS one.

[20]  Feifei Li,et al.  DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning , 2017, CCS.

[21]  Luis Perez,et al.  The Effectiveness of Data Augmentation in Image Classification using Deep Learning , 2017, ArXiv.

[22]  Sung-Bae Cho,et al.  Machine Learning in DNA Microarray Analysis for Cancer Classification , 2003, APBC.

[23]  A. Kraskov,et al.  Estimating mutual information. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[24]  Michael I. Jordan,et al.  Detecting large-scale system problems by mining console logs , 2009, SOSP '09.

[25]  Jun Sun,et al.  Poster: Benchmarking Microservice Systems for Software Engineering Research , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering: Companion (ICSE-Companion).

[26]  Sam Newman,et al.  Building microservices - designing fine-grained systems, 1st Edition , 2015 .