The Use of Template Miners and Encryption in Log Message Compression

Presently, almost every computer software produces many log messages based on events and activities during the usage of the software. These files contain valuable runtime information that can be used in a variety of applications such as anomaly detection, error prediction, template mining, and so on. Usually, the generated log messages are raw, which means they have an unstructured format. This indicates that these messages have to be parsed before data mining models can be applied. After parsing, template miners can be applied on the data to retrieve the events occurring in the log file. These events are made from two parts, the template, which is the fixed part and is the same for all instances of the same event type, and the parameter part, which varies for all the instances. To decrease the size of the log messages, we use the mined templates to build a dictionary for the events, and only store the dictionary, the event ID, and the parameter list. We use six template miners to acquire the templates namely IPLoM, LenMa, LogMine, Spell, Drain, and MoLFI. In this paper, we evaluate the compression capacity of our dictionary method with the use of these algorithms. Since parameters could be sensitive information, we also encrypt the files after compression and measure the changes in file size. We also examine the speed of the log miner algorithms. Based on our experiments, LenMa has the best compression rate with an average of 67.4%; however, because of its high runtime, we would suggest the combination of our dictionary method with IPLoM and FFX, since it is the fastest of all methods, and it has a 57.7% compression rate.

[1]  Felix Klaedtke,et al.  Scalable Offline Monitoring , 2014, RV.

[2]  Yang Wang,et al.  On the Feasibility of Parser-based Log Compression in Large-Scale Cloud Systems , 2021, FAST.

[3]  Thomas Reidemeister,et al.  Diagnosis of recurrent faults using log files , 2009, CASCON.

[4]  Shilin He,et al.  Loghub: A Large Collection of System Log Datasets towards Automated Log Analytics , 2020, ArXiv.

[5]  Jakub Breier,et al.  Anomaly Detection from Log Files Using Data Mining Techniques , 2015 .

[6]  Keiichi Shima,et al.  Length Matters: Clustering System Log Messages using Length of Words , 2016, ArXiv.

[7]  Hans-Arno Jacobsen,et al.  PreDict: Predictive Dictionary Maintenance for Message Compression in Publish/Subscribe , 2018, Middleware.

[8]  Chang-Tien Lu,et al.  Outlier Detection , 2008, Encyclopedia of GIS.

[9]  Heng Li,et al.  A study of the performance of general compressors on log files , 2020, Empirical Software Engineering.

[10]  Boris Ryabko Time-universal data compression and prediction , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[11]  Lin Chen,et al.  Categorical Feature Compression via Submodular Optimization , 2019, ICML.

[12]  Shilin He,et al.  A Survey on Automated Log Analysis for Reliability Engineering , 2020, ACM Comput. Surv..

[13]  Joseph Fong,et al.  An e-customer behavior model with online analytical mining for internet marketing planning , 2005, Decis. Support Syst..

[14]  Peng Fei,et al.  SEAL: Storage-efficient Causality Analysis on Enterprise Logs with Query-friendly Compression , 2021, USENIX Security Symposium.

[15]  Daniel E. Lucani,et al.  Lossless Compression of Time Series Data with Generalized Deduplication , 2019, 2019 IEEE Global Communications Conference (GLOBECOM).

[16]  Andreas Dengel,et al.  DeepAnT: A Deep Learning Approach for Unsupervised Anomaly Detection in Time Series , 2019, IEEE Access.

[17]  Mo Chen,et al.  A Method of Large - Scale Log Pattern Mining , 2017, HCC.

[18]  Evangelos E. Milios,et al.  A Lightweight Algorithm for Message Type Extraction in System Application Logs , 2012, IEEE Transactions on Knowledge and Data Engineering.

[19]  Wang-Chien Lee,et al.  Personalized ranking for digital libraries based on log analysis , 2008, WIDM '08.

[20]  David A. Wagner,et al.  Tweakable Block Ciphers , 2002, Journal of Cryptology.

[21]  Bruce Schneier,et al.  Description of a New Variable-Length Key, 64-bit Block Cipher (Blowfish) , 1993, FSE.

[22]  Toby P. Breckon,et al.  GANomaly: Semi-Supervised Anomaly Detection via Adversarial Training , 2018, ACCV.

[23]  Kaisa Nyberg,et al.  Generalized Feistel Networks , 1996, ASIACRYPT.

[24]  Ahmed E. Hassan,et al.  Improving State-of-the-art Compression Techniques for Log Management Tools , 2021 .

[25]  Jian Li,et al.  An Evaluation Study on Log Parsing and Its Use in Log Mining , 2016, 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[26]  Vir V. Phoha,et al.  K-Means+ID3: A Novel Method for Supervised Anomaly Detection by Cascading K-Means Clustering and ID3 Decision Tree Learning Methods , 2007, IEEE Transactions on Knowledge and Data Engineering.

[27]  Mihir Bellare,et al.  Format-Preserving Encryption , 2009, IACR Cryptol. ePrint Arch..

[28]  Emiel Hoogeboom,et al.  Integer Discrete Flows and Lossless Compression , 2019, NeurIPS.

[29]  Zibin Zheng,et al.  Tools and Benchmarks for Automated Log Parsing , 2018, 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP).