论文信息 - Efficiently compressing string columnar data using frequent pattern mining

Efficiently compressing string columnar data using frequent pattern mining

In modern column-oriented databases, compression is important for improving I/O throughput and overall database performance. Many string columnar data cannot be compressed by special-purpose algorithms such as run-length encoding or dictionary compression, and the typical choice for them is the LZ77-based compression algorithms such as GZIP [16] or Snappy [13]. These algorithms treat data as a byte block and do not exploit the columnar nature of the data. In this thesis, we develop a compression algorithm using frequent string patterns directly mined from a sample of a string column. The patterns are used as the dictionary phrases for compression. We discuss some interesting properties of frequent patterns in the context of compression, and develop a pruning method to address the cache inefficiencies in indexing the patterns. Experiments show that our compression algorithm outperforms Snappy in compression ratio while retains compression and decompression speed.

Xiaojian Wang

[1] Hugh E. Williams,et al. Compressing Integers for Fast File Access , 1999, Comput. J..

[2] Hugh E. Williams,et al. A general-purpose compression scheme for large collections , 2002, TOIS.

[3] Jonathan Goldstein,et al. Compressing relations and indexes , 1998, Proceedings 14th International Conference on Data Engineering.

[4] Toon Calders,et al. Mining Compressing Sequential Patterns , 2012, Stat. Anal. Data Min..

[5] Ulf Leser,et al. Trends in Genome Compression , 2014 .

[6] Justin Zobel,et al. Relative Lempel-Ziv Compression of Genomes for Large-Scale Storage and Retrieval , 2010, SPIRE.

[7] Szymon Grabowski,et al. Robust relative compression of genomes with random access , 2011, Bioinform..

[8] Ian H. Witten,et al. Modeling for text compression , 1989, CSUR.

[9] Zhiwei Xu,et al. RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[10] Norman May,et al. The SAP HANA Database -- An Architecture Overview , 2012, IEEE Data Eng. Bull..

[11] A. Moffat,et al. Offline dictionary-based compression , 2000, Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).

[12] James A. Storer,et al. Data compression via textual substitution , 1982, JACM.

[13] Abraham Lempel,et al. Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[14] Ian H. Witten,et al. Identifying Hierarchical Structure in Sequences: A linear-time algorithm , 1997, J. Artif. Intell. Res..

[15] Fernando Pereira,et al. Wikilinks: A Large-scale Cross-Document Coreference Corpus Labeled via Links to Wikipedia , 2012 .

[16] Peter Deutsch,et al. GZIP file format specification version 4.3 , 1996, RFC.

[17] Daniel J. Abadi,et al. Integrating compression and execution in column-oriented database systems , 2006, SIGMOD Conference.

[18] Per-Åke Larson,et al. Columnar Storage in SQL Server 2012 , 2012, IEEE Data Eng. Bull..

[19] Carsten Binnig,et al. Dictionary-based order-preserving string compression for main memory column stores , 2009, SIGMOD Conference.

[20] Michael Stonebraker,et al. C-Store: A Column-oriented DBMS , 2005, VLDB.

[21] Marcin Zukowski,et al. MonetDB/X100: Hyper-Pipelining Query Execution , 2005, CIDR.

[22] Glen G. Langdon,et al. Universal modeling and coding , 1981, IEEE Trans. Inf. Theory.

[23] Leonid Boytsov,et al. Decoding billions of integers per second through vectorization , 2012, Softw. Pract. Exp..

[24] Gennady Antoshenkov,et al. Dictionary-based order-preserving string compression , 1997, The VLDB Journal.

[25] Terry A. Welch,et al. A Technique for High-Performance Data Compression , 1984, Computer.

[26] Qiming Chen,et al. PrefixSpan,: mining sequential patterns efficiently by prefix-projected pattern growth , 2001, Proceedings 17th International Conference on Data Engineering.

[27] Marcin Zukowski,et al. Super-Scalar RAM-CPU Cache Compression , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[28] David A. Huffman,et al. A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[29] David J. DeWitt,et al. Materialization Strategies in a Column-Oriented DBMS , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[30] Sara P. Garcia,et al. GReEn: a tool for efficient compression of genome resequencing data , 2011, Nucleic acids research.

[31] Viktor Leis,et al. The adaptive radix tree: ARTful indexing for main-memory databases , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[32] Abraham Lempel,et al. A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[33] Donald R. Morrison,et al. PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric , 1968, J. ACM.

[34] Ramakrishnan Srikant,et al. Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[35] Peter Deutsch,et al. DEFLATE Compressed Data Format Specification version 1.3 , 1996, RFC.

[36] Hugh E. Williams,et al. General-purpose compression for efficient retrieval , 2001, J. Assoc. Inf. Sci. Technol..

[37] Jilles Vreeken,et al. The long and the short of it: summarising event sequences with serial episodes , 2012, KDD.

[38] Hugh E. Williams,et al. Burst tries: a fast, efficient data structure for string keys , 2002, TOIS.

[39] David J. DeWitt,et al. How to barter bits for chronons: compression and bandwidth trade offs for database scans , 2007, SIGMOD '07.

[40] Jian Pei,et al. Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[41] Justin Zobel,et al. Relative Lempel-Ziv Factorization for Efficient Storage and Retrieval of Web Collections , 2011, Proc. VLDB Endow..

[42] Ulf Leser,et al. FRESCO: Referential Compression of Highly Similar Sequences , 2013, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[43] Martin L. Kersten,et al. MonetDB: Two Decades of Research in Column-oriented Database Architectures , 2012, IEEE Data Eng. Bull..

[44] Jiawei Han,et al. BIDE: efficient mining of frequent closed sequences , 2004, Proceedings. 20th International Conference on Data Engineering.

[45] Ranjan Sinha,et al. HAT-Trie: A Cache-Conscious Trie-Based Data Structure For Strings , 2007, ACSC.

[46] Jae-Gil Lee,et al. Business Analytics in (a) Blink , 2012, IEEE Data Eng. Bull..

[47] Jure Leskovec,et al. {SNAP Datasets}: {Stanford} Large Network Dataset Collection , 2014 .

[48] Jianyong Wang,et al. Efficient mining of frequent sequence generators , 2008, WWW.