Skipping-oriented Partitioning for Columnar Layouts

As data volumes continue to grow, modern database systems increasingly rely on data skipping mechanisms to improve performance by avoiding access to irrelevant data. Recent work [39] proposed a fine-grained partitioning scheme that was shown to improve the opportunities for data skipping in row-oriented systems. Modern analytics and big data systems increasingly adopt columnar storage schemes, and in such systems, a row-based approach misses important opportunities for further improving data skipping. The flexibility of column-oriented organizations, however, comes with the additional cost of tuple reconstruction. In this paper, we develop Generalized Skipping-Oriented Partitioning (GSOP), a novel hybrid data skipping framework that takes into account these row-based and column-based tradeoffs. In contrast to previous column-oriented physical design work, GSOP considers the tradeoffs between horizontal data skipping and vertical partitioning jointly. Our experiments using two public benchmarks and a real-world workload show that GSOP can significantly reduce the amount of data scanned and improve end-to-end query response times over the state-of-the- art techniques.

[1]  S. M. Keni Digital Sky Surveys , 1988 .

[2]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[3]  Guido Moerkotte,et al.  Small Materialized Aggregates: A Light Weight Index Structure for Data Warehousing , 1998, VLDB.

[4]  Surajit Chaudhuri,et al.  Automated Selection of Materialized Views and Indexes in SQL Databases , 2000, VLDB.

[5]  Chun Zhang,et al.  Automating physical database design in a parallel database , 2002, SIGMOD '02.

[6]  David J. DeWitt,et al.  Data page layouts for relational databases on deep memory hierarchies , 2002, The VLDB Journal.

[7]  Matthew Huras,et al.  Efficient Query Processing for Multi-Dimensionally Clustered Tables in DB2 , 2003, VLDB.

[8]  Jignesh M. Patel,et al.  Data Morphing: An Adaptive, Cache-Conscious Storage Technique , 2003, VLDB.

[9]  Kenneth A. Ross,et al.  A multi-resolution block storage model for database design , 2003, Seventh International Database Engineering and Applications Symposium, 2003. Proceedings..

[10]  Anastasia Ailamaki,et al.  AutoPart: automating schema design for large scientific databases using data partitioning , 2004, Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004..

[11]  Vivek R. Narasayya,et al.  Integrating vertical and horizontal partitioning into automated physical database design , 2004, SIGMOD '04.

[12]  Marcin Zukowski,et al.  MonetDB/X100: Hyper-Pipelining Query Execution , 2005, CIDR.

[13]  Michael Stonebraker,et al.  C-Store: A Column-oriented DBMS , 2005, VLDB.

[14]  Daniel J. Abadi,et al.  Integrating compression and execution in column-oriented database systems , 2006, SIGMOD Conference.

[15]  David J. DeWitt,et al.  Materialization Strategies in a Column-Oriented DBMS , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[16]  Martin L. Kersten,et al.  Database Cracking , 2007, CIDR.

[17]  Piotr Synak,et al.  Brighthouse: an analytic data warehouse for ad-hoc queries , 2008, Proc. VLDB Endow..

[18]  Marcin Zukowski,et al.  DSM vs. NSM: CPU performance tradeoffs in block-oriented query processing , 2008, DaMoN '08.

[19]  Martin L. Kersten,et al.  Self-organizing tuple reconstruction in column-stores , 2009, SIGMOD Conference.

[20]  Alexander Zeier,et al.  HYRISE - A Main Memory Hybrid Storage Engine , 2010, Proc. VLDB Endow..

[21]  Andrey Gubarev,et al.  Dremel : Interactive Analysis of Web-Scale Datasets , 2011 .

[22]  Carlo Curino,et al.  Schism , 2010, Proc. VLDB Endow..

[23]  Jorge-Arnulfo Quiané-Ruiz,et al.  Trojan data layouts: right shoes for a running elephant , 2011, SoCC.

[24]  Zhiwei Xu,et al.  RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[25]  Jignesh M. Patel,et al.  Column-Oriented Storage Techniques for MapReduce , 2011, Proc. VLDB Endow..

[26]  Ramakrishna Varadarajan,et al.  The Vertica Analytic Database: C-Store 7 Years Later , 2012, Proc. VLDB Endow..

[27]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[28]  Wei Lin,et al.  Advanced partitioning techniques for massively distributed computation , 2012, SIGMOD Conference.

[29]  Alexander Hall,et al.  Processing a Trillion Cells per Mouse Click , 2012, Proc. VLDB Endow..

[30]  Sam Lightstone,et al.  DB2 with BLU Acceleration: So Much More than Just a Column Store , 2013, Proc. VLDB Endow..

[31]  Alekh Jindal,et al.  The Uncracked Pieces in Database Cracking , 2013, Proc. VLDB Endow..

[32]  Alekh Jindal,et al.  A Comparison of Knives for Bread Slicing , 2013, Proc. VLDB Endow..

[33]  Siyuan Ma,et al.  Understanding Insights into the Basic Structure and Essential Issues of Table Placement Methods in Clusters , 2013, Proc. VLDB Endow..

[34]  Doug McMahon,et al.  JSON data management: supporting schema-less development in RDBMS , 2014, SIGMOD Conference.

[35]  Liwen Sun,et al.  Fine-grained partitioning for aggressive data skipping , 2014, SIGMOD Conference.

[36]  Anastasia Ailamaki,et al.  H2O: a hands-free adaptive store , 2014, SIGMOD Conference.

[37]  Yuan Yuan,et al.  Major technical advancements in apache hive , 2014, SIGMOD Conference.

[38]  Joseph K. Bradley,et al.  Spark SQL: Relational Data Processing in Spark , 2015, SIGMOD Conference.

[39]  Anurag Gupta,et al.  Amazon Redshift and the Case for Simpler Data Warehouses , 2015, SIGMOD Conference.

[40]  Ashish Motivala,et al.  The Snowflake Elastic Data Warehouse , 2016, SIGMOD Conference.