A Semantic-aware Representation Framework for Online Log Analysis

Logs are one of the most valuable data sources for large-scale service management. Log representation, which converts unstructured texts to structured vectors or matrices, serves as the the first step towards automated log analysis. However, the current log representation methods neither represent domain-specific semantic information of logs, nor handle the out-of-vocabulary (OOV) words of new types of logs at runtime. We propose Log2Vec, a semantic-aware representation framework for log analysis. Log2Vec combines a log-specific word embedding method to accurately extract the semantic information of logs, with an OOV word processor to embed OOV words into vectors at runtime. We present an analysis on the impact of OOV words and evaluate the performance of the OOV word processor. The evaluation experiments on four public production log datasets demonstrate that Log2Vec not only fixes the issue presented by OOV words, but also significantly improves the performance of two popular log-based service management tasks, including log classification and anomaly detection. We have packaged Log2Vec into an open-source toolkit and hope that it can be used for future research.

[1]  Zibin Zheng,et al.  Drain: An Online Log Parsing Approach with Fixed Depth Tree , 2017, 2017 IEEE International Conference on Web Services (ICWS).

[2]  Shenglin Zhang,et al.  PreFix: Switch Failure Prediction in Datacenter Networks , 2018, Proc. ACM Meas. Anal. Comput. Syst..

[3]  Jon Stearley,et al.  What Supercomputers Say: A Study of Five System Logs , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[4]  Michael Roth,et al.  Combining Word Patterns and Discourse Markers for Paradigmatic Relation Classification , 2014, ACL.

[5]  Quoc V. Le,et al.  Exploiting Similarities among Languages for Machine Translation , 2013, ArXiv.

[6]  Yu Zhang,et al.  Log Clustering Based Problem Identification for Online Service Systems , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering Companion (ICSE-C).

[7]  Dongmei Zhang,et al.  Identifying impactful service system problems via log analysis , 2018, ESEC/SIGSOFT FSE.

[8]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[9]  Shilin He,et al.  Experience Report: System Log Analysis for Anomaly Detection , 2016, 2016 IEEE 27th International Symposium on Software Reliability Engineering (ISSRE).

[10]  Charles L. A. Clarke,et al.  Lexical Comparison Between Wikipedia and Twitter Corpora by Using Word Embeddings , 2015, ACL.

[11]  Shenglin Zhang,et al.  Device-Agnostic Log Anomaly Classification with Partial Labels , 2018, 2018 IEEE/ACM 26th International Symposium on Quality of Service (IWQoS).

[12]  Ngoc Thang Vu,et al.  Integrating Distributional Lexical Contrast into Word Embeddings for Antonym-Synonym Distinction , 2016, ACL.

[13]  Shenglin Zhang,et al.  LogAnomaly: Unsupervised Detection of Sequential and Quantitative Anomalies in Unstructured Logs , 2019, IJCAI.

[14]  Niloy Ganguly,et al.  ADELE: Anomaly Detection from Event Log Empiricism , 2018, IEEE INFOCOM 2018 - IEEE Conference on Computer Communications.

[15]  Feifei Li,et al.  DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning , 2017, CCS.

[16]  Feifei Li,et al.  Spell: Streaming Parsing of System Event Logs , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[17]  Shenglin Zhang,et al.  Syslog processing for switch failure diagnosis and prediction in datacenter networks , 2017, 2017 IEEE/ACM 25th International Symposium on Quality of Service (IWQoS).

[18]  Ralf Zimmer,et al.  RelEx - Relation extraction using dependency parse trees , 2007, Bioinform..

[19]  Zhen Wang,et al.  Knowledge Graph and Text Jointly Embedding , 2014, EMNLP.

[20]  Shenglin Zhang,et al.  FUNNEL: Assessing Software Changes in Web-Based Services , 2018, IEEE Transactions on Services Computing.

[21]  Tao Li,et al.  LogSig: generating system events from raw textual logs , 2011, CIKM '11.

[22]  Evangelos E. Milios,et al.  Clustering event logs using iterative partitioning , 2009, KDD.

[23]  Jacob Eisenstein,et al.  Mimicking Word Embeddings using Subword RNNs , 2017, EMNLP.

[24]  Zibin Zheng,et al.  Tools and Benchmarks for Automated Log Parsing , 2018, 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP).

[25]  Yu Hu,et al.  Learning Semantic Word Embeddings based on Ordinal Knowledge Constraints , 2015, ACL.

[26]  Chris Mellish,et al.  Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04) , 2004, ACL 2004.