A Padded Encoding Scheme to Accelerate Scans by Leveraging Skew

In-memory data analytic systems that use vertical bit-parallel scan methods generally use encoding techniques. We observe that in such environments, there is an opportunity to turn skew in both the data and predicate distributions (usually a problem for query processing) into a benefit that can be leveraged to encode the column values. This paper proposes a padded encoding scheme to address this opportunity. The proposed scheme creates encodings that map common attribute values to codes that can easily be distinguished from other codes by only examining a few bits in the full code. Consequently, scans on columns stored using the padded encoding scheme can safely prune the computation without examining all the bits in the code, thereby reducing the memory bandwidth and CPU cycles that are consumed when evaluating scan queries. Our padded encoding method results in a fixed-length encoding, as fixed-length encodings are easier to manage. However, the proposed padded encoding may produce longer (fixed-length) codes than those produced by popular order-preserving encoding methods, such as dictionary-based encoding. This additional space overhead has the potential to negate the gains from early pruning of the scan computation. However, as we demonstrate empirically, the additional space overhead is generally small, and the padded encoding scheme provides significant performance improvements.

[1]  Philip S. Yu,et al.  An effective algorithm for parallelizing hash joins in the presence of data skew , 1991, [1991] Proceedings. Seventh International Conference on Data Engineering.

[2]  Patrick E. O'Neil,et al.  Bit-sliced index arithmetic , 2001, SIGMOD '01.

[3]  E. F. Moore,et al.  Variable-length binary encodings , 1959 .

[4]  Ismail Oukid,et al.  Vectorizing Database Column Scans with Complex Predicates , 2013, ADMS@VLDB.

[5]  Adriano M. Garsia,et al.  A New Algorithm for Minimum Cost Binary Trees , 1977, SIAM J. Comput..

[6]  Kenneth A. Ross,et al.  High throughput heavy hitter aggregation for modern SIMD processors , 2013, DaMoN '13.

[7]  Jignesh M. Patel,et al.  WideTable: An Accelerator for Analytical Data Processing , 2014, Proc. VLDB Endow..

[8]  Alfred G. Dale,et al.  A Taxonomy and Performance Model of Data Skew Effects in Parallel Joins , 1991, VLDB.

[9]  Kenneth A. Ross,et al.  Implementing database operations using SIMD instructions , 2002, SIGMOD '02.

[10]  Donald E. Knuth,et al.  The Art of Computer Programming: Volume 3: Sorting and Searching , 1998 .

[11]  Pradeep Dubey,et al.  Fast Updates on Read-Optimized Databases Using Multi-Core CPUs , 2011, Proc. VLDB Endow..

[12]  Christos Faloutsos,et al.  On B-Tree Indices for Skewed Distributions , 1992, VLDB.

[13]  Donald Kossmann,et al.  Adaptive Range Filters for Cold Data: Avoiding Trips to Siberia , 2013, Proc. VLDB Endow..

[14]  Liwen Sun,et al.  Fine-grained partitioning for aggressive data skipping , 2014, SIGMOD Conference.

[15]  Philip S. Yu,et al.  Range-based bitmap indexing for high cardinality attributes with skew , 1998, Proceedings. The Twenty-Second Annual International Computer Software and Applications Conference (Compsac '98) (Cat. No.98CB 36241).

[16]  Wolfgang Lehner,et al.  Fast integer compression using SIMD instructions , 2010, DaMoN '10.

[17]  T. C. Hu,et al.  Optimal Computer Search Trees and Variable-Length Alphabetical Codes , 1971 .

[18]  Jignesh M. Patel,et al.  A comparison of join algorithms for log processing in MaPreduce , 2010, SIGMOD Conference.

[19]  Ramakrishna Varadarajan,et al.  The Vertica Analytic Database: C-Store 7 Years Later , 2012, Proc. VLDB Endow..

[20]  Alexander Zeier,et al.  SIMD-Scan: Ultra Fast in-Memory Table Scan using on-Chip Vector Processing Units , 2009, Proc. VLDB Endow..

[21]  Andrey Gubarev,et al.  Dremel : Interactive Analysis of Web-Scale Datasets , 2011 .

[22]  Liang Chen,et al.  Handling data skew in parallel joins in shared-nothing systems , 2008, SIGMOD Conference.

[23]  Patrick E. O'Neil,et al.  Improved query performance with variant indexes , 1997, SIGMOD '97.

[24]  Bingsheng He,et al.  Database compression on graphics processors , 2010, Proc. VLDB Endow..

[25]  Magdalena Balazinska,et al.  SkewTune: mitigating skew in mapreduce applications , 2012, SIGMOD Conference.

[26]  Carsten Binnig,et al.  Dictionary-based order-preserving string compression for main memory column stores , 2009, SIGMOD Conference.

[27]  Sam Lightstone,et al.  DB2 with BLU Acceleration: So Much More than Just a Column Store , 2013, Proc. VLDB Endow..

[28]  David J. DeWitt,et al.  Practical Skew Handling in Parallel Joins , 1992, VLDB.

[29]  Carlo Curino,et al.  Skew-aware automatic database partitioning in shared-nothing, parallel OLTP systems , 2012, SIGMOD Conference.

[30]  Gustavo Alonso,et al.  Predictable Performance for Unpredictable Workloads , 2009, Proc. VLDB Endow..

[31]  Jignesh M. Patel,et al.  BitWeaving: fast scans for main memory data processing , 2013, SIGMOD '13.

[32]  David Thomas,et al.  The Art in Computer Programming , 2001 .

[33]  Jae-Gil Lee,et al.  Joins on Encoded and Partitioned Data , 2014, Proc. VLDB Endow..

[34]  Daniel J. Abadi,et al.  Integrating compression and execution in column-oriented database systems , 2006, SIGMOD Conference.

[35]  Frederick Reiss,et al.  Constant-Time Query Processing , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[36]  Johannes Gehrke,et al.  Query optimization in compressed database systems , 2001, SIGMOD '01.

[37]  Clifford A. Lynch,et al.  Selectivity Estimation and Query Optimization in Large Databases with Highly Skewed Distribution of Column Values , 1988, VLDB.

[38]  Norman May,et al.  The SAP HANA Database -- An Architecture Overview , 2012, IEEE Data Eng. Bull..

[39]  Ryan Johnson,et al.  Row-wise parallel predicate evaluation , 2008, Proc. VLDB Endow..

[40]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[41]  Hakan Ferhatosmanoglu,et al.  Approximate encoding for direct access and query processing over compressed bitmaps , 2006, VLDB.

[42]  Donald E. Knuth,et al.  The art of computer programming, volume 3: (2nd ed.) sorting and searching , 1998 .

[43]  John Allen,et al.  Scuba: Diving into Data at Facebook , 2013, Proc. VLDB Endow..

[44]  Wei Li,et al.  Skew handling techniques in sort-merge join , 2002, SIGMOD '02.