From a Comprehensive Experimental Survey to a Cost-based Selection Strategy for Lightweight Integer Compression Algorithms

Lightweight integer compression algorithms are frequently applied in in-memory database systems to tackle the growing gap between processor speed and main memory bandwidth. In recent years, the vectorization of basic techniques such as delta coding and null suppression has considerably enlarged the corpus of available algorithms. As a result, today there is a large number of algorithms to choose from, while different algorithms are tailored to different data characteristics. However, a comparative evaluation of these algorithms with different data and hardware characteristics has never been sufficiently conducted in the literature. To close this gap, we conducted an exhaustive experimental survey by evaluating several state-of-the-art lightweight integer compression algorithms as well as cascades of basic techniques. We systematically investigated the influence of data as well as hardware properties on the performance and the compression rates. The evaluated algorithms are based on publicly available implementations as well as our own vectorized reimplementations. We summarize our experimental findings leading to several new insights and to the conclusion that there is no single-best algorithm. Moreover, in this article, we also introduce and evaluate a novel cost model for the selection of a suitable lightweight integer compression algorithm for a given dataset.

[1]  Jeffrey D. Ullman,et al.  Index selection for OLAP , 1997, Proceedings 13th International Conference on Data Engineering.

[2]  Martin L. Kersten,et al.  Generic Database Cost Models for Hierarchical Memory Systems , 2002, VLDB.

[3]  Alfons Kemper,et al.  Main Memory Database Systems , 2017, Found. Trends Databases.

[4]  Krithi Ramamritham,et al.  Materialized view selection and maintenance using multi-query optimization , 2000, SIGMOD '01.

[5]  Peter J. Haas,et al.  Maintaining bounded-size sample synopses of evolving datasets , 2008, The VLDB Journal.

[6]  Wolfgang Lehner,et al.  Conflict Detection-Based Run-Length Encoding - AVX-512 CD Instruction Set in Action , 2018, 2018 IEEE 34th International Conference on Data Engineering Workshops (ICDEW).

[7]  Stanley B. Zdonik,et al.  An automatic physical design tool for clustered column-stores , 2013, EDBT '13.

[8]  Carsten Binnig,et al.  Dictionary-based order-preserving string compression for main memory column stores , 2009, SIGMOD Conference.

[9]  Alfons Kemper,et al.  Data Blocks: Hybrid OLTP and OLAP on Compressed Storage using both Vectorization and Compilation , 2016, SIGMOD Conference.

[10]  Wolfgang Lehner,et al.  Adaptive Work Placement for Query Processing on Heterogeneous Computing Resources , 2017, Proc. VLDB Endow..

[11]  Ian H. Witten,et al.  Arithmetic coding for data compression , 1987, CACM.

[12]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[13]  Viktor Leis,et al.  How Good Are Query Optimizers, Really? , 2015, Proc. VLDB Endow..

[14]  Ross N. Williams,et al.  An extremely fast Ziv-Lempel data compression algorithm , 1991, [1991] Proceedings. Data Compression Conference.

[15]  Daniel J. Abadi,et al.  Column-stores vs. row-stores: how different are they really? , 2008, SIGMOD Conference.

[16]  Wolfgang Lehner,et al.  Fast integer compression using SIMD instructions , 2010, DaMoN '10.

[17]  Marcin Zukowski,et al.  MonetDB/X100: Hyper-Pipelining Query Execution , 2005, CIDR.

[18]  Wolfgang Lehner The Data Center under your Desk - How Disruptive is Modern Hardware for DB System Design? , 2017, Proc. VLDB Endow..

[19]  Marcin Zukowski,et al.  Super-Scalar RAM-CPU Cache Compression , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[20]  Jae-Gil Lee,et al.  Joins on Encoded and Partitioned Data , 2014, Proc. VLDB Endow..

[21]  Daniel J. Abadi,et al.  Integrating compression and execution in column-oriented database systems , 2006, SIGMOD Conference.

[22]  Wolfgang Lehner,et al.  ERIS: A NUMA-Aware In-Memory Storage Engine for Analytical Workload , 2014, ADMS@VLDB.

[23]  Wolfgang Lehner,et al.  Compression-Aware In-Memory Query Processing: Vision, System Design and Beyond , 2016, ADMS/IMDM@VLDB.

[24]  Martin L. Kersten,et al.  Breaking the memory wall in MonetDB , 2008, CACM.

[25]  Wolfgang Lehner,et al.  A Benchmark Framework for Data Compression Techniques , 2015, TPCTC.

[26]  Patrick Damme Query Processing Based on Compressed Intermediates , 2017, PhD@VLDB.

[27]  Leonid Boytsov,et al.  Decoding billions of integers per second through vectorization , 2012, Softw. Pract. Exp..

[28]  Krzysztof Kaczmarski,et al.  Compression Planner for Time Series Database with GPU Support , 2014, Trans. Large Scale Data Knowl. Centered Syst..

[29]  Mark A. Roth,et al.  Database compression , 1993, SGMD.

[30]  Daniel Lemire,et al.  Vectorized VByte Decoding , 2015, ArXiv.

[31]  Wolfgang Lehner,et al.  Make Larger Vector Register Sizes New Challenges?: Lessons Learned from the Area of Vectorized Lightweight Compression Algorithms , 2018, DBTest@SIGMOD.

[32]  Ismail Oukid,et al.  Data Structure Engineering For Byte-Addressable Non-Volatile Memory , 2017, SIGMOD Conference.

[33]  Wolfgang Lehner,et al.  Lightweight Data Compression Algorithms: An Experimental Survey (Experiments and Analyses) , 2017, EDBT.

[34]  Lasse Natvig,et al.  V-PFORDelta: Data Compression for Energy Efficient Computation of Time Series , 2015, 2015 IEEE 22nd International Conference on High Performance Computing (HiPC).

[35]  Jonathan Goldstein,et al.  Compressing relations and indexes , 1998, Proceedings 14th International Conference on Data Engineering.

[36]  Wolfgang Lehner,et al.  MorphStore - In-Memory Query Processing based on Morphing Compressed Intermediates LIVE , 2019, SIGMOD Conference.

[37]  Wolfgang Lehner,et al.  Direct Transformation Techniques for Compressed Data: General Approach and Application Scenarios , 2015, ADBIS.

[38]  Johannes Gehrke,et al.  Query optimization in compressed database systems , 2001, SIGMOD '01.

[39]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[40]  Martin Grund,et al.  Impala: A Modern, Open-Source SQL Engine for Hadoop , 2015, CIDR.

[41]  Alexander A. Stepanov,et al.  SIMD-based decoding of posting lists , 2011, CIKM '11.

[42]  Wolfgang Lehner,et al.  Metamodeling Lightweight Data Compression Algorithms and its Application Scenarios , 2017, ER Forum/Demos.

[43]  Hongfei Yan,et al.  A General SIMD-Based Approach to Accelerating Compression Algorithms , 2015, TOIS.