A study of the performance of general compressors on log files

Large-scale software systems and cloud services continue to produce a large amount of log data. Such log data is usually preserved for a long time (e.g., for auditing purposes). General compressors, like the LZ77 compressor used in gzip, are usually used in practice to compress log data to reduce the cost of long-term storage. However, such general compressors do not consider the unique nature of log data. In this paper, we study the performance of general compressors on compressing log data relative to their performance on compressing natural language data. We used 12 widely used general compressors to compress nine log files that are collected based on surveying prior literature on text compression, log compression and log analysis. We observe that log data is more repetitive than natural language data, and that log data can be compressed and decompressed faster with higher compression ratios. Besides, the compressor with the highest compression ratio for natural language data is rarely the one for log data. Nevertheless, the compressors with the highest compression ratio for log data are rarely adopted in practice by current logging libraries and log management tools. We also observe that the peak compression and decompression speeds of general compressors on log data is often achieved with a small data size, while such size may not be used by log management tools. Finally, we observe that the optimal compression performance (measured by a combined compression performance score) of log data usually requires the compression level to be configured higher than the default level. Our findings call for careful consideration of choosing general compressors and their associated compression levels for log data in practice. In addition, our findings shed lights on the opportunities for future research on compressors that better suit the characteristics of log data.

[1]  Amar Mukherjee,et al.  LIPT: a lossless text transform to improve compression , 2001, Proceedings International Conference on Information Technology: Coding and Computing.

[2]  Chentao Wu,et al.  MLC: An Efficient Multi-level Log Compression Method for Cloud Backup Systems , 2016, 2016 IEEE Trustcom/BigDataSE/ISPA.

[3]  Shilin He,et al.  Experience Report: System Log Analysis for Anomaly Detection , 2016, 2016 IEEE 27th International Symposium on Software Reliability Engineering (ISSRE).

[4]  Ahmed E. Hassan,et al.  Leveraging Performance Counters and Execution Logs to Diagnose Memory-Related Performance Issues , 2013, 2013 IEEE International Conference on Software Maintenance.

[5]  Qiang Fu,et al.  Mining Invariants from Console Logs for System Problem Detection , 2010, USENIX Annual Technical Conference.

[6]  Gonzalo Navarro,et al.  Compact Data Structures - A Practical Approach , 2016 .

[7]  Frans M. J. Willems,et al.  The context-tree weighting method: basic properties , 1995, IEEE Trans. Inf. Theory.

[8]  Steffen Herbold,et al.  Comments on ScottKnottESD in Response to “An Empirical Comparison of Model Validation Techniques for Defect Prediction Models” , 2017, IEEE Transactions on Software Engineering.

[9]  S. Grabowski,et al.  Sub-atomic field processing for improved web log compression , 2008, 2008 International Conference on "Modern Problems of Radio Engineering, Telecommunications and Computer Science" (TCSET).

[10]  Claes Wohlin,et al.  Guidelines for snowballing in systematic literature studies and a replication in software engineering , 2014, EASE '14.

[11]  Hannah Thinyane,et al.  Evaluating text preprocessing to improve compression on maillogs , 2009, SAICSIT '09.

[12]  Gilbert Hamann,et al.  An automated approach for abstracting execution logs to execution events , 2008, J. Softw. Maintenance Res. Pract..

[13]  Pokkuluri Kiran Sree,et al.  FELFCNCA: Fast & Efficient Log File Compression Using Non Linear Cellular Automata Classifier , 2013, ArXiv.

[14]  Zhen Ming Jiang,et al.  Characterizing and Detecting Anti-Patterns in the Logging Code , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE).

[15]  Minyi Guo,et al.  Cowic: A Column-Wise Independent Compression for Log Stream Analysis , 2015, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[16]  Barry E. Mullins,et al.  An analysis of XML compression efficiency , 2007, ExpCS '07.

[17]  Ramendra K. Sahoo,et al.  Lossless compression for large scale cluster logs , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[18]  Shane McIntosh,et al.  An Empirical Comparison of Model Validation Techniques for Defect Prediction Models , 2017, IEEE Transactions on Software Engineering.

[19]  Rajeev Gandhi,et al.  Mochi: Visual Log-Analysis Based Tools for Debugging Hadoop , 2009, HotCloud.

[20]  Mouad Lemoudden,et al.  Managing cloud-generated logs using big data technologies , 2015, 2015 International Conference on Wireless Networks and Mobile Communications (WINCOM).

[21]  Qiang Fu,et al.  Execution Anomaly Detection in Distributed Systems through Unstructured Log Analysis , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[22]  Qiang Fu,et al.  Mining dependency in distributed systems through unstructured logs analysis , 2010, OPSR.

[23]  Dongmei Zhang,et al.  Predicting Node failure in cloud service systems , 2018, ESEC/SIGSOFT FSE.

[24]  G. Blelloch Introduction to Data Compression * , 2022 .

[25]  Maxim Smirnov,et al.  Data Compression Explained , 2010 .

[26]  Enio G. Jelihovschi,et al.  ScottKnott: A Package for Performing the Scott-Knott Clustering Algorithm in R , 2014 .

[27]  Ding Yuan,et al.  Characterizing logging practices in open-source software , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[28]  Qiang Fu,et al.  Learning to Log: Helping Developers Make Informed Logging Decisions , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[29]  Antonio Pescapè,et al.  Efficient Storage and Processing of High-Volume Network Monitoring Data , 2013, IEEE Transactions on Network and Service Management.

[30]  Dongmei Zhang,et al.  Identifying impactful service system problems via log analysis , 2018, ESEC/SIGSOFT FSE.

[31]  P. Sarbanes,et al.  Sarbanes-Oxley Act of 2002 , 2002 .

[32]  Balázs Rácz,et al.  High density compression of log files , 2004, Data Compression Conference, 2004. Proceedings. DCC 2004.

[33]  Ahmed E. Hassan,et al.  An Industrial Case Study on Speeding Up User Acceptance Testing by Mining Execution Logs , 2010, 2010 Fourth International Conference on Secure Software Integration and Reliability Improvement.

[34]  Ying Zou,et al.  Towards just-in-time suggestions for log changes , 2016, Empirical Software Engineering.

[35]  Premkumar T. Devanbu,et al.  On the naturalness of software , 2016, Commun. ACM.

[36]  Matthew V. Mahoney,et al.  Fast Text Compression with Neural Networks , 2000, FLAIRS Conference.

[37]  Miles Osborne,et al.  Statistical Machine Translation , 2010, Encyclopedia of Machine Learning and Data Mining.

[38]  Jon Stearley,et al.  Bad Words: Finding Faults in Spirit's Syslogs , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[39]  Ding Yuan,et al.  SherLog: error diagnosis by connecting clues from run-time logs , 2010, ASPLOS XV.

[40]  Alexander Aiken,et al.  Alert Detection in System Logs , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[41]  Zibin Zheng,et al.  Tools and Benchmarks for Automated Log Parsing , 2018, 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP).

[42]  Ding Yuan,et al.  Improving Software Diagnosability via Log Enhancement , 2012, TOCS.

[43]  Heng Li,et al.  Adopting Autonomic Computing Capabilities in Existing Large-Scale Systems , 2017, 2018 IEEE/ACM 40th International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP).

[44]  Leonardo Mariani,et al.  Automated Identification of Failure Causes in System Logs , 2008, 2008 19th International Symposium on Software Reliability Engineering (ISSRE).

[45]  Rajeev Gandhi,et al.  Visual, Log-Based Causal Tracing for Performance Debugging of MapReduce Systems , 2010, 2010 IEEE 30th International Conference on Distributed Computing Systems.

[46]  Jon Stearley,et al.  What Supercomputers Say: A Study of Five System Logs , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[47]  Ling Huang,et al.  Online System Problem Detection by Mining Patterns of Console Logs , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[48]  Rajeev Gandhi,et al.  SALSA: Analyzing Logs as StAte Machines , 2008, WASL.

[49]  A. Scott,et al.  A Cluster Analysis Method for Grouping Means in the Analysis of Variance , 1974 .

[50]  Jakub Swacha,et al.  Fast and Efficient Log File Compression , 2007, ADBIS Research Communications.

[51]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[52]  Yang Liu,et al.  Be conservative: enhancing failure diagnosis with proactive logging , 2012, OSDI 2012.

[53]  Jennifer Neville,et al.  Structured Comparative Analysis of Systems Logs to Diagnose Performance Problems , 2012, NSDI.

[54]  Gilbert Hamann,et al.  Automatic identification of load testing problems , 2008, 2008 IEEE International Conference on Software Maintenance.

[55]  Shilin He,et al.  Characterizing the Natural Language Descriptions in Software Logging Statements , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[56]  Jenny Mead,et al.  Sarbanes‐Oxley Act , 2015 .

[57]  Feifei Li,et al.  Adaptive log compression for massive log data , 2013, SIGMOD '13.

[58]  A. Hassan,et al.  An Industrial Case Study of Customizing Operational Profiles Using Log Compression , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[59]  Richard E. Harang,et al.  Lightweight Packing of Log Files for Improved Compression in Mobile Tactical Networks , 2014, 2014 IEEE Military Communications Conference.

[60]  Jean-François Boulicaut,et al.  Comprehensive Log Compression with Frequent Patterns , 2003, DaWaK.