Skipping-oriented Data Design for Large-Scale Analytics

Author(s): Sun, Liwen | Advisor(s): Franklin, Michael J | Abstract: As data volumes continue to expand, analytics approaches that require exhaustively scanning data sets become untenable. For this reason, modern analytics systems employ data skipping techniques to avoid looking at large volumes of irrelevant data. By maintaining some metadata for each block of data, a query may skip a data block if the metadata indicates that the block does not contain relevant data. The effectiveness of data skipping, however, depends on how the underlying data are organized into blocks. In this dissertation, we propose a fine-grained data layout framework, called ``Generalized Skipping-Oriented Partitioning and Replication'' (GSOP-R), which aims to maximize query performance through aggressive data skipping. Based on observations of real-world analytics workloads, we find that the workload patterns can be summarized as a succinct set of features. The GSOP-R framework uses these features to transform the incoming data into a small set of feature vectors, and then performs clustering algorithms using the feature vectors instead of the actual data. A resulting GSOP-R layout scheme is highly flexible. For instance, it allows different columns to be horizontally partitioned in different ways and supports replication of only parts of rows or columns. We developed several designs for GSOP-R on Apache Spark and Apache Parquet and then evaluated their performance using two public benchmarks and several real-world workloads. Our results show that GSOP-R can reduce the amount of data scanned and improve end-to-end query response times over the state-of-the-art techniques by a factor of 2 to 9.

[1]  Marcin Zukowski,et al.  DSM vs. NSM: CPU performance tradeoffs in block-oriented query processing , 2008, DaMoN '08.

[2]  Ashish Motivala,et al.  The Snowflake Elastic Data Warehouse , 2016, SIGMOD Conference.

[3]  Daniel J. Abadi,et al.  Integrating compression and execution in column-oriented database systems , 2006, SIGMOD Conference.

[4]  Aidong Zhang,et al.  Cluster analysis for gene expression data: a survey , 2004, IEEE Transactions on Knowledge and Data Engineering.

[5]  Wei Lin,et al.  Advanced partitioning techniques for massively distributed computation , 2012, SIGMOD Conference.

[6]  Surajit Chaudhuri,et al.  An Efficient Cost-Driven Index Selection Tool for Microsoft SQL Server , 1997, VLDB.

[7]  Donald Kossmann,et al.  Adaptive Range Filters for Cold Data: Avoiding Trips to Siberia , 2013, Proc. VLDB Endow..

[8]  Kenneth A. Ross,et al.  A multi-resolution block storage model for database design , 2003, Seventh International Database Engineering and Applications Symposium, 2003. Proceedings..

[9]  Shashi Shekhar,et al.  Multilevel hypergraph partitioning: application in VLSI domain , 1997, DAC.

[10]  S. M. Keni Digital Sky Surveys , 1988 .

[11]  Carlo Curino,et al.  Schism , 2010, Proc. VLDB Endow..

[12]  Pauli Miettinen,et al.  The Discrete Basis Problem , 2006, IEEE Transactions on Knowledge and Data Engineering.

[13]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[14]  Alekh Jindal,et al.  The Uncracked Pieces in Database Cracking , 2013, Proc. VLDB Endow..

[15]  Guido Moerkotte,et al.  Small Materialized Aggregates: A Light Weight Index Structure for Data Warehousing , 1998, VLDB.

[16]  Shashi Shekhar,et al.  Multilevel hypergraph partitioning: applications in VLSI domain , 1999, IEEE Trans. Very Large Scale Integr. Syst..

[17]  David J. DeWitt,et al.  A case for fractured mirrors , 2003, The VLDB Journal.

[18]  Anurag Gupta,et al.  Amazon Redshift and the Case for Simpler Data Warehouses , 2015, SIGMOD Conference.

[19]  Yuan Yuan,et al.  Major technical advancements in apache hive , 2014, SIGMOD Conference.

[20]  Joseph K. Bradley,et al.  Spark SQL: Relational Data Processing in Spark , 2015, SIGMOD Conference.

[21]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[22]  Alexander Hall,et al.  Processing a Trillion Cells per Mouse Click , 2012, Proc. VLDB Endow..

[23]  Martin L. Kersten,et al.  Database Cracking , 2007, CIDR.

[24]  Anastasia Ailamaki,et al.  H2O: a hands-free adaptive store , 2014, SIGMOD Conference.

[25]  Liwen Sun,et al.  Skipping-oriented Partitioning for Columnar Layouts , 2016, Proc. VLDB Endow..

[26]  Hisashi Koga,et al.  Fast agglomerative hierarchical clustering algorithm using Locality-Sensitive Hashing , 2007, Knowledge and Information Systems.

[27]  Piotr Synak,et al.  Brighthouse: an analytic data warehouse for ad-hoc queries , 2008, Proc. VLDB Endow..

[28]  Jérôme Darmont,et al.  Clustering-Based Materialized View Selection in Data Warehouses , 2006, ADBIS.

[29]  Alexander Zeier,et al.  HYRISE - A Main Memory Hybrid Storage Engine , 2010, Proc. VLDB Endow..

[30]  Siyuan Ma,et al.  Understanding Insights into the Basic Structure and Essential Issues of Table Placement Methods in Clusters , 2013, Proc. VLDB Endow..

[31]  Alekh Jindal,et al.  A Comparison of Knives for Bread Slicing , 2013, Proc. VLDB Endow..

[32]  Martin L. Kersten,et al.  Self-organizing tuple reconstruction in column-stores , 2009, SIGMOD Conference.

[33]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[34]  Inderpal Singh Mumick,et al.  Selection of Views to Materialize Under a Maintenance Cost Constraint , 1999, ICDT.

[35]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[36]  Pierre Hansen,et al.  NP-hardness of Euclidean sum-of-squares clustering , 2008, Machine Learning.

[37]  Jignesh M. Patel,et al.  Column-Oriented Storage Techniques for MapReduce , 2011, Proc. VLDB Endow..

[38]  Vivek R. Narasayya,et al.  Integrating vertical and horizontal partitioning into automated physical database design , 2004, SIGMOD '04.

[39]  Marcin Zukowski,et al.  MonetDB/X100: Hyper-Pipelining Query Execution , 2005, CIDR.

[40]  Venky Harinarayan,et al.  Implementing Data Cubes E ciently , 1996 .

[41]  Sam Lightstone,et al.  DB2 with BLU Acceleration: So Much More than Just a Column Store , 2013, Proc. VLDB Endow..

[42]  E. F. CODD,et al.  A relational model of data for large shared data banks , 1970, CACM.

[43]  Ion Stoica,et al.  BlinkDB: queries with bounded errors and bounded response times on very large data , 2012, EuroSys '13.

[44]  Jorge-Arnulfo Quiané-Ruiz,et al.  Trojan data layouts: right shoes for a running elephant , 2011, SoCC.

[45]  Scott Shenker,et al.  Shark: SQL and rich analytics at scale , 2012, SIGMOD '13.

[46]  Chun Zhang,et al.  Automating physical database design in a parallel database , 2002, SIGMOD '02.

[47]  Ramakrishna Varadarajan,et al.  The Vertica Analytic Database: C-Store 7 Years Later , 2012, Proc. VLDB Endow..

[48]  Zhiwei Xu,et al.  RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[49]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[50]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[51]  Doug McMahon,et al.  JSON data management: supporting schema-less development in RDBMS , 2014, SIGMOD Conference.

[52]  Surajit Chaudhuri,et al.  Automated Selection of Materialized Views and Indexes in SQL Databases , 2000, VLDB.

[53]  Jiawei Han,et al.  Frequent pattern mining: current status and future directions , 2007, Data Mining and Knowledge Discovery.

[54]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[55]  Liwen Sun,et al.  Fine-grained partitioning for aggressive data skipping , 2014, SIGMOD Conference.

[56]  Anastasia Ailamaki,et al.  AutoPart: automating schema design for large scientific databases using data partitioning , 2004, Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004..

[57]  Liwen Sun,et al.  A Partitioning Framework for Aggressive Data Skipping , 2014, Proc. VLDB Endow..

[58]  Elena Baralis,et al.  Materialized Views Selection in a Multidimensional Database , 1997, VLDB.

[59]  Matthew Huras,et al.  Efficient Query Processing for Multi-Dimensionally Clustered Tables in DB2 , 2003, VLDB.

[60]  Michael Stonebraker,et al.  C-Store: A Column-oriented DBMS , 2005, VLDB.

[61]  David J. DeWitt,et al.  Data page layouts for relational databases on deep memory hierarchies , 2002, The VLDB Journal.

[62]  Jignesh M. Patel,et al.  Data Morphing: An Adaptive, Cache-Conscious Storage Technique , 2003, VLDB.

[63]  David J. DeWitt,et al.  Materialization Strategies in a Column-Oriented DBMS , 2007, 2007 IEEE 23rd International Conference on Data Engineering.