Efficiently compressing string columnar data using frequent pattern mining

In modern column-oriented databases, compression is important for improving I/O throughput and overall database performance. Many string columnar data cannot be compressed by special-purpose algorithms such as run-length encoding or dictionary compression, and the typical choice for them is the LZ77-based compression algorithms such as GZIP [16] or Snappy [13]. These algorithms treat data as a byte block and do not exploit the columnar nature of the data. In this thesis, we develop a compression algorithm using frequent string patterns directly mined from a sample of a string column. The patterns are used as the dictionary phrases for compression. We discuss some interesting properties of frequent patterns in the context of compression, and develop a pruning method to address the cache inefficiencies in indexing the patterns. Experiments show that our compression algorithm outperforms Snappy in compression ratio while retains compression and decompression speed.

[1]  Hugh E. Williams,et al.  Compressing Integers for Fast File Access , 1999, Comput. J..

[2]  Hugh E. Williams,et al.  A general-purpose compression scheme for large collections , 2002, TOIS.

[3]  Jonathan Goldstein,et al.  Compressing relations and indexes , 1998, Proceedings 14th International Conference on Data Engineering.

[4]  Toon Calders,et al.  Mining Compressing Sequential Patterns , 2012, Stat. Anal. Data Min..

[5]  Ulf Leser,et al.  Trends in Genome Compression , 2014 .

[6]  Justin Zobel,et al.  Relative Lempel-Ziv Compression of Genomes for Large-Scale Storage and Retrieval , 2010, SPIRE.

[7]  Szymon Grabowski,et al.  Robust relative compression of genomes with random access , 2011, Bioinform..

[8]  Ian H. Witten,et al.  Modeling for text compression , 1989, CSUR.

[9]  Zhiwei Xu,et al.  RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[10]  Norman May,et al.  The SAP HANA Database -- An Architecture Overview , 2012, IEEE Data Eng. Bull..

[11]  A. Moffat,et al.  Offline dictionary-based compression , 2000, Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).

[12]  James A. Storer,et al.  Data compression via textual substitution , 1982, JACM.

[13]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[14]  Ian H. Witten,et al.  Identifying Hierarchical Structure in Sequences: A linear-time algorithm , 1997, J. Artif. Intell. Res..

[15]  Fernando Pereira,et al.  Wikilinks: A Large-scale Cross-Document Coreference Corpus Labeled via Links to Wikipedia , 2012 .

[16]  Peter Deutsch,et al.  GZIP file format specification version 4.3 , 1996, RFC.

[17]  Daniel J. Abadi,et al.  Integrating compression and execution in column-oriented database systems , 2006, SIGMOD Conference.

[18]  Per-Åke Larson,et al.  Columnar Storage in SQL Server 2012 , 2012, IEEE Data Eng. Bull..

[19]  Carsten Binnig,et al.  Dictionary-based order-preserving string compression for main memory column stores , 2009, SIGMOD Conference.

[20]  Michael Stonebraker,et al.  C-Store: A Column-oriented DBMS , 2005, VLDB.

[21]  Marcin Zukowski,et al.  MonetDB/X100: Hyper-Pipelining Query Execution , 2005, CIDR.

[22]  Glen G. Langdon,et al.  Universal modeling and coding , 1981, IEEE Trans. Inf. Theory.

[23]  Leonid Boytsov,et al.  Decoding billions of integers per second through vectorization , 2012, Softw. Pract. Exp..

[24]  Gennady Antoshenkov,et al.  Dictionary-based order-preserving string compression , 1997, The VLDB Journal.

[25]  Terry A. Welch,et al.  A Technique for High-Performance Data Compression , 1984, Computer.

[26]  Qiming Chen,et al.  PrefixSpan,: mining sequential patterns efficiently by prefix-projected pattern growth , 2001, Proceedings 17th International Conference on Data Engineering.

[27]  Marcin Zukowski,et al.  Super-Scalar RAM-CPU Cache Compression , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[28]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[29]  David J. DeWitt,et al.  Materialization Strategies in a Column-Oriented DBMS , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[30]  Sara P. Garcia,et al.  GReEn: a tool for efficient compression of genome resequencing data , 2011, Nucleic acids research.

[31]  Viktor Leis,et al.  The adaptive radix tree: ARTful indexing for main-memory databases , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[32]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[33]  Donald R. Morrison,et al.  PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric , 1968, J. ACM.

[34]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[35]  Peter Deutsch,et al.  DEFLATE Compressed Data Format Specification version 1.3 , 1996, RFC.

[36]  Hugh E. Williams,et al.  General-purpose compression for efficient retrieval , 2001, J. Assoc. Inf. Sci. Technol..

[37]  Jilles Vreeken,et al.  The long and the short of it: summarising event sequences with serial episodes , 2012, KDD.

[38]  Hugh E. Williams,et al.  Burst tries: a fast, efficient data structure for string keys , 2002, TOIS.

[39]  David J. DeWitt,et al.  How to barter bits for chronons: compression and bandwidth trade offs for database scans , 2007, SIGMOD '07.

[40]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[41]  Justin Zobel,et al.  Relative Lempel-Ziv Factorization for Efficient Storage and Retrieval of Web Collections , 2011, Proc. VLDB Endow..

[42]  Ulf Leser,et al.  FRESCO: Referential Compression of Highly Similar Sequences , 2013, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[43]  Martin L. Kersten,et al.  MonetDB: Two Decades of Research in Column-oriented Database Architectures , 2012, IEEE Data Eng. Bull..

[44]  Jiawei Han,et al.  BIDE: efficient mining of frequent closed sequences , 2004, Proceedings. 20th International Conference on Data Engineering.

[45]  Ranjan Sinha,et al.  HAT-Trie: A Cache-Conscious Trie-Based Data Structure For Strings , 2007, ACSC.

[46]  Jae-Gil Lee,et al.  Business Analytics in (a) Blink , 2012, IEEE Data Eng. Bull..

[47]  Jure Leskovec,et al.  {SNAP Datasets}: {Stanford} Large Network Dataset Collection , 2014 .

[48]  Jianyong Wang,et al.  Efficient mining of frequent sequence generators , 2008, WWW.