Dictionary-based order-preserving string compression for main memory column stores

Column-oriented database systems [19, 23] perform better than traditional row-oriented database systems on analytical workloads such as those found in decision support and business intelligence applications. Moreover, recent work [1, 24] has shown that lightweight compression schemes significantly improve the query processing performance of these systems. One such a lightweight compression scheme is to use a dictionary in order to replace long (variable-length) values of a certain domain with shorter (fixedlength) integer codes. In order to further improve expensive query operations such as sorting and searching, column-stores often use order-preserving compression schemes. In contrast to the existing work, in this paper we argue that orderpreserving dictionary compression does not only pay off for attributes with a small fixed domain size but also for long string attributes with a large domain size which might change over time. Consequently, we introduce new data structures that efficiently support an order-preserving dictionary compression for (variablelength) string attributes with a large domain size that is likely to change over time. The main idea is that we model a dictionary as a table that specifies a mapping from string-values to arbitrary integer codes (and vice versa) and we introduce a novel indexing approach that provides efficient access paths to such a dictionary while compressing the index data. Our experiments show that our data structures are as fast as (or in some cases even faster than) other state-of-the-art data structures for dictionaries while being less memory intensive.

[1]  David B. Lomet,et al.  Order Preserving Compression , 1996, ICDE 1996.

[2]  David B. Lomet The evolution of effective B-tree: page organization and techniques: a personal account , 2001, SGMD.

[3]  Jonathan Goldstein,et al.  Compressing relations and indexes , 1998, Proceedings 14th International Conference on Data Engineering.

[4]  Justin Zobel,et al.  Cache-efficient string sorting using copying , 2007, ACM J. Exp. Algorithmics.

[5]  David J. Lilja,et al.  Data prefetch mechanisms , 2000, CSUR.

[6]  Donald E. Knuth,et al.  The Art of Computer Programming: Volume 3: Sorting and Searching , 1998 .

[7]  Rudolf Bayer,et al.  Prefix B-trees , 1977, TODS.

[8]  Goetz Graefe,et al.  B-tree indexes and CPU caches , 2001, Proceedings 17th International Conference on Data Engineering.

[9]  Michael Stonebraker,et al.  C-Store: A Column-oriented DBMS , 2005, VLDB.

[10]  W. Paul Cockshott,et al.  High-Performance Operations Using a Compressed Database Architecture , 1998, Comput. J..

[11]  Ranjan Sinha,et al.  HAT-Trie: A Cache-Conscious Trie-Based Data Structure For Strings , 2007, ACSC.

[12]  Robert Sedgewick,et al.  Fast algorithms for sorting and searching strings , 1997, SODA '97.

[13]  Kenneth A. Ross,et al.  Buffering Accesses to Memory-Resident Index Structures , 2003, VLDB.

[14]  Kenneth A. Ross,et al.  Making B+- trees cache conscious in main memory , 2000, SIGMOD '00.

[15]  Daniel J. Abadi,et al.  Integrating compression and execution in column-oriented database systems , 2006, SIGMOD Conference.

[16]  Ian H. Witten,et al.  Managing gigabytes (2nd ed.): compressing and indexing documents and images , 1999 .

[17]  Daniel J. Abadi,et al.  Performance tradeoffs in read-optimized databases , 2006, VLDB.

[18]  Rajeev Rastogi,et al.  Main-memory index structures with fixed-size partial keys , 2001, SIGMOD '01.

[19]  Kenneth A. Ross,et al.  Cache Conscious Indexing for Decision-Support in Main Memory , 1999, VLDB.

[20]  Hector Garcia-Molina,et al.  Main Memory Database Systems: An Overview , 1992, IEEE Trans. Knowl. Data Eng..

[21]  Marcin Zukowski,et al.  Super-Scalar RAM-CPU Cache Compression , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[22]  Sven Helmer,et al.  The implementation and performance of compressed databases , 2000, SGMD.

[23]  Johannes Gehrke,et al.  Query optimization in compressed database systems , 2001, SIGMOD '01.

[24]  Goetz Graefe,et al.  Data compression and database performance , 1991, [Proceedings] 1991 Symposium on Applied Computing.

[25]  Donald R. Morrison,et al.  PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric , 1968, J. ACM.

[26]  Justin Zobel,et al.  Cache-Conscious Collision Resolution in String Hash Tables , 2005, SPIRE.

[27]  Donald E. Knuth,et al.  The art of computer programming, volume 3: (2nd ed.) sorting and searching , 1998 .

[28]  G. Antoshenkov,et al.  Order preserving string compression , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[29]  Marcin Zukowski,et al.  MonetDB/X100 - A DBMS In The CPU Cache , 2005, IEEE Data Eng. Bull..