Upscaledb: Efficient integer-key compression in a key-value store using SIMD instructions

Compression can sometimes improve performance by making more of the data available to the processors faster. We consider the compression of integer keys in a B+-tree index. For this purpose, systems such as IBM DB2 use variable-byte compression over differentially coded keys. We revisit this problem with various compression alternatives such as Google's VarIntGB, Binary Packing and Frame-of-Reference. In all cases, we describe algorithms that can operate directly on compressed data. Many of our alternatives exploit the single-instruction-multiple-data (SIMD) instructions supported by modern CPUs. We evaluate our techniques in a database environment provided by Upscaledb, a production-quality key-value database. Our best techniques are SIMD accelerated: they simultaneously reduce memory usage while improving single-threaded speeds. In particular, a differentially coded SIMD binary-packing techniques (BP128) can offer a superior query speed (e.g., 40% better than an uncompressed database) while providing the best compression (e.g., by a factor of ten). For analytic workloads, our fast compression techniques offer compelling benefits. Our software is available as open source.

[1]  Sven Helmer,et al.  The implementation and performance of compressed databases , 2000, SGMD.

[2]  Kenneth A. Ross,et al.  Efficient Index Compression in DB2 LUW , 2009, Proc. VLDB Endow..

[3]  Tae-Sun Chung,et al.  Node Compression Techniques Based on Cache-Sensitive B+-Tree , 2010, 2010 IEEE/ACIS 9th International Conference on Computer and Information Science.

[4]  Carsten Binnig,et al.  Dictionary-based order-preserving string compression for main memory column stores , 2009, SIGMOD Conference.

[5]  Sam Lightstone,et al.  DB2 with BLU Acceleration: So Much More than Just a Column Store , 2013, Proc. VLDB Endow..

[6]  Pradeep Dubey,et al.  FAST: fast architecture sensitive tree search on modern CPUs and GPUs , 2010, SIGMOD Conference.

[7]  Alexander Zeier,et al.  SIMD-Scan: Ultra Fast in-Memory Table Scan using on-Chip Vector Processing Units , 2009, Proc. VLDB Endow..

[8]  Leonid Boytsov,et al.  SIMD compression and the intersection of sorted integers , 2014, Softw. Pract. Exp..

[9]  Giuseppe Ottaviano,et al.  Partitioned Elias-Fano indexes , 2014, SIGIR.

[10]  Goetz Graefe Efficient columnar storage in B-trees , 2007, SGMD.

[11]  Wolfgang Lehner,et al.  k-ary search on modern processors , 2009, DaMoN '09.

[12]  Ioana Stanoi,et al.  A Tree for All Seasons , 2006, 2006 10th International Database Engineering and Applications Symposium (IDEAS'06).

[13]  Owen Kaser,et al.  Consistently faster and smaller compressed bitmaps with Roaring , 2016, Softw. Pract. Exp..

[14]  Douglas Comer,et al.  Ubiquitous B-Tree , 1979, CSUR.

[15]  Hugh E. Williams,et al.  Compressing Integers for Fast File Access , 1999, Comput. J..

[16]  Sang-goo Lee,et al.  CST-Trees: Cache Sensitive T-Trees , 2007, DASFAA.

[17]  Michael Stonebraker,et al.  C-Store: A Column-oriented DBMS , 2005, VLDB.

[18]  Rudolf Bayer,et al.  Organization and maintenance of large ordered indexes , 1972, Acta Informatica.

[19]  Patrick E. O'Neil,et al.  The log-structured merge-tree (LSM-tree) , 1996, Acta Informatica.

[20]  Leonid Boytsov,et al.  Decoding billions of integers per second through vectorization , 2012, Softw. Pract. Exp..

[21]  Charles L. A. Clarke,et al.  Hybrid index maintenance for contiguous inverted lists , 2007, Information Retrieval.

[22]  Hongfei Yan,et al.  A General SIMD-Based Approach to Accelerating Compression Algorithms , 2015, TOIS.

[23]  Frank Wm. Tompa,et al.  Skewed partial bitvectors for list intersection , 2014, SIGIR.

[24]  Alistair Moffat,et al.  Self-indexing inverted files for fast text retrieval , 1996, TOIS.

[25]  Jukka Teuhola Interpolative coding of integer sequences supporting log-time random access , 2011, Inf. Process. Manag..

[26]  Sebastiano Vigna,et al.  Quasi-succinct indices , 2012, WSDM.

[27]  Jonathan Goldstein,et al.  Compressing relations and indexes , 1998, Proceedings 14th International Conference on Data Engineering.

[28]  J. Shane Culpepper,et al.  Efficient set intersection for inverted indexing , 2010, TOIS.

[29]  Nicole Bauer,et al.  Information Retrieval Implementing And Evaluating Search Engines , 2016 .

[30]  Marcin Zukowski,et al.  Super-Scalar RAM-CPU Cache Compression , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[31]  Kenneth A. Ross,et al.  Implementing database operations using SIMD instructions , 2002, SIGMOD '02.

[32]  Alistair Moffat,et al.  Index compression using 64‐bit words , 2010, Softw. Pract. Exp..

[33]  M. Oguzhan Külekci Enhanced Variable-Length Codes: Improved Compression with Efficient Random Access , 2014, 2014 Data Compression Conference.

[34]  Gonzalo Navarro,et al.  DACs: Bringing direct access to variable-length codes , 2013, Inf. Process. Manag..

[35]  Makoto Onizuka,et al.  VAST-Tree: a vector-advanced and compressed structure for massive data tree traversal , 2012, EDBT '12.

[36]  Margo I. Seltzer,et al.  Berkeley DB , 1999, USENIX Annual Technical Conference, FREENIX Track.

[37]  Peter Sanders,et al.  Engineering basic algorithms of an in-memory text search engine , 2010, TOIS.

[38]  PandisIppokratis,et al.  DB2 with BLU acceleration , 2013, VLDB 2013.

[39]  Leonidas J. Guibas,et al.  A dichromatic framework for balanced trees , 1978, 19th Annual Symposium on Foundations of Computer Science (sfcs 1978).

[40]  Alexander A. Stepanov,et al.  SIMD-based decoding of posting lists , 2011, CIKM '11.

[41]  Jeffrey Dean,et al.  Challenges in building large-scale information retrieval systems: invited talk , 2009, WSDM '09.

[42]  Patrick K. Nicholson,et al.  On the compression of search trees , 2014, Inf. Process. Manag..

[43]  Alistair Moffat,et al.  Binary Interpolative Coding for Effective Index Compression , 2000, Information Retrieval.

[44]  Daniel Lemire,et al.  Vectorized VByte Decoding , 2015, ArXiv.

[45]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[46]  Gonzalo Navarro Wavelet trees for all , 2014, J. Discrete Algorithms.