White-box Compression: Learning and Exploiting Compact Table Representations

We formulate a conceptual model for white-box compression, which represents the logical columns in tabular data as an openly defined function over some actually stored physical columns. Each block of data should thus go accompanied by a header that describes this functional mapping. Because these compression functions are openly defined, database systems can exploit them using query optimization and during execution, enabling e.g. better filter predicate pushdown. In addition, we show that white-box compression is able to identify a broad variety of new opportunities for compression, leading to much better compression factors. These opportunities are identified using an automatic learning process that learns the functions from the data. We provide a recursive pattern-driven algorithm for such learning. Finally, we demonstrate the effectiveness of white-box compression on a new benchmark we contribute hereby: the Public BI benchmark provides a rich set of real-world datasets. We believe our basic prototype for white-box compression opens the way for future research into transparent compressed data representations on the one hand and database system architectures that can efficiently exploit these on the other, and should be seen as another step into the direction of data management systems that are self-learning and optimize themselves for the data they are deployed on.

[1]  Divesh Srivastava,et al.  Proceedings of the 2018 International Conference on Management of Data , 2018, SIGMOD Conference.

[2]  Jae-Gil Lee,et al.  Joins on Encoded and Partitioned Data , 2014, Proc. VLDB Endow..

[3]  Daniel J. Abadi,et al.  Integrating compression and execution in column-oriented database systems , 2006, SIGMOD Conference.

[4]  Sam Lightstone,et al.  DB2 with BLU Acceleration: So Much More than Just a Column Store , 2013, Proc. VLDB Endow..

[5]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[6]  Stratos Idreos,et al.  The Data Calculator: Data Structure Design and Cost Synthesis from First Principles and Learned Cost Models , 2018, SIGMOD Conference.

[7]  Tim Kraska,et al.  SageDB: A Learned Database System , 2019, CIDR.

[8]  D. Huffman A Method for the Construction of Minimum-Redundancy Codes , 1952 .

[9]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[10]  Alfons Kemper,et al.  HyPer: A hybrid OLTP&OLAP main memory database system based on virtual memory snapshots , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[11]  S.A.B.C. Heman,et al.  Updating Compressed Column-Stores , 2015 .

[12]  Alfons Kemper,et al.  Data Blocks: Hybrid OLTP and OLAP on Compressed Storage using both Vectorization and Compilation , 2016, SIGMOD Conference.

[13]  Jonathan Goldstein,et al.  Compressing relations and indexes , 1998, Proceedings 14th International Conference on Data Engineering.

[14]  Michael Haubenschild,et al.  Get Real: How Benchmarks Fail to Represent the Real World , 2018, DBTest@SIGMOD.

[15]  Marcin Zukowski,et al.  Vectorwise: A Vectorized Analytical DBMS , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[16]  Raghunath Othayoth Nambiar,et al.  The making of TPC-DS , 2006, VLDB.

[17]  Marcin Zukowski,et al.  Super-Scalar RAM-CPU Cache Compression , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[18]  Leonid Boytsov,et al.  Decoding billions of integers per second through vectorization , 2012, Softw. Pract. Exp..

[19]  Kenneth A. Ross,et al.  Efficient Lightweight Compression Alongside Fast Scans , 2015, DaMoN.

[20]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[21]  Mark A. Roth,et al.  Database compression , 1993, SGMD.

[22]  Garret Swart,et al.  How to wring a table dry: entropy compression of relations and querying of compressed relations , 2006, VLDB.

[23]  Thomas Neumann,et al.  TPC-H Analyzed: Hidden Messages and Lessons Learned from an Influential Benchmark , 2013, TPCTC.

[24]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[25]  Goetz Graefe,et al.  Data compression and database performance , 1991, [Proceedings] 1991 Symposium on Applied Computing.

[26]  Bolin Ding,et al.  Columnstore and B+ tree - Are Hybrid Physical Designs Important? , 2018, SIGMOD Conference.

[27]  Ian H. Witten,et al.  Arithmetic coding for data compression , 1987, CACM.