High density compression of log files

Today there is an emerging demand of Internet and network related service to collect the valuable service usage data and process it using data mining methods. In this paper, a generalized scheme for preprocessing and high-density compression of log files is presented. The aim of the method is to provide a base for long-term storage in a form appropriate for direct processing by data mining algorithms. Experiments on real log data show that the differentiated semantic log compression (dslc) methods compress at 2-3%, outperforming general-purpose compression utilities. This paper also demonstrates the flexibility of the pipeline concept by inlaying a field-wise compression algorithm to improve the compression efficiency. The implementation of this scheme was designed for the largest Hungarian Internet content provider.