Linnaeus: A highly reusable and adaptable ML based log classification pipeline

Logs are a common way to record detailed run-time information in software. As modern software systems evolve in scale and complexity, logs have become indispensable to understanding the internal states of the system. At the same time however, manually inspecting logs has become impractical. In recent times, there has been more emphasis on statistical and machine learning (ML) based methods for analyzing logs. While the results have shown promise, most of the literature focuses on algorithms and state-of-the-art (SOTA), while largely ignoring the practical aspects. In this paper we demonstrate our end-to-end log classification pipeline, Linnaeus. Besides showing the more traditional ML flow, we also demonstrate our solutions for adaptability and re-use, integration towards large scale software development processes, and how we cope with lack of labelled data. We hope Linnaeus can serve as a blueprint for, and inspire the integration of, various ML based solutions in other large scale industrial settings.

[1]  Zibin Zheng,et al.  Logzip: Extracting Hidden Structures via Iterative Clustering for Log Compression , 2019, 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[2]  Weixi Li,et al.  Automatic Log Analysis using Machine Learning : Awesome Automatic Log Analysis version 2.0 , 2013 .

[3]  David Broman,et al.  Automatic Localization of Bugs to Faulty Components in Large Scale Software Systems Using Bayesian Classification , 2016, 2016 IEEE International Conference on Software Quality, Reliability and Security (QRS).

[4]  Sami Nousiainen,et al.  Anomaly detection from server log data. A case study , 2009 .

[5]  David Broman,et al.  Automated bug assignment: Ensemble-based machine learning in large scale industrial contexts , 2016, Empirical Software Engineering.

[6]  Zibin Zheng,et al.  Tools and Benchmarks for Automated Log Parsing , 2018, 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP).

[7]  Shilin He,et al.  Towards Automated Log Parsing for Large-Scale Log Data Analysis , 2018, IEEE Transactions on Dependable and Secure Computing.

[8]  D. Sculley,et al.  Hidden Technical Debt in Machine Learning Systems , 2015, NIPS.

[9]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[10]  Ahmed E. Hassan,et al.  The Impact of Classifier Configuration and Classifier Combination on Bug Localization , 2013, IEEE Transactions on Software Engineering.

[11]  Feifei Li,et al.  Spell: Online Streaming Parsing of Large Unstructured System Logs , 2019, IEEE Transactions on Knowledge and Data Engineering.

[12]  David Broman,et al.  Towards Automated Anomaly Report Assignment in Large Complex Systems Using Stacked Generalization , 2012, 2012 IEEE Fifth International Conference on Software Testing, Verification and Validation.

[13]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.