Efficient Data Blocking and Skipping Framework Applying Heuristic Rules

Data blocking has been an effective technique of data skipping to reduce data access and shorten query response time in query engines. By generating fine-grained, balanced blocks and corresponding metadata, a query may skip a block if the metadata indicates that the block does not contain relevant data. Obviously, the deciding factor of a promising blocking strategy depends on how to produce effective data layout in reasonable time that is expected to skip most data. In this paper, we propose several algorithms that drastically reduce the time complexity of existent blocking strategies based on workload analysis, at the cost of relatively small loss of estimated tuples could be skipped. Via theoretical analysis, we prove that the time complexity of our algorithms is apparently lower than that of ward algorithm. Afterwards, we demonstrate the whole blocking and skipping workflow, install it into Spark SQL and obtain experimental evaluation results. Experimental results show that our technique gains significant improvement in aspect of blocking efficiency compared to ward algorithm, while keeping almost the same level of skipping ability.

[1]  Wei Lin,et al.  Advanced partitioning techniques for massively distributed computation , 2012, SIGMOD Conference.

[2]  Ge Yu,et al.  HMVR-tree: A Multi-version R-tree Based on HBase for Concurrent Access , 2016, BigCom.

[3]  Carlo Curino,et al.  Schism , 2010, Proc. VLDB Endow..

[4]  Sam Lightstone,et al.  DB2 with BLU Acceleration: So Much More than Just a Column Store , 2013, Proc. VLDB Endow..

[5]  Shaojie Tang,et al.  Efficient R-Tree Based Indexing Scheme for Server-Centric Cloud Storage System , 2016, IEEE Transactions on Knowledge and Data Engineering.

[6]  Scott Shenker,et al.  Shark: SQL and rich analytics at scale , 2012, SIGMOD '13.

[7]  Chun Zhang,et al.  Automating physical database design in a parallel database , 2002, SIGMOD '02.

[8]  Jie Shen,et al.  Workload Partitioning for Accelerating Applications on Heterogeneous Platforms , 2016, IEEE Transactions on Parallel and Distributed Systems.

[9]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[10]  Surajit Chaudhuri,et al.  Automated Selection of Materialized Views and Indexes in SQL Databases , 2000, VLDB.

[11]  Jeffrey D. Ullman,et al.  Implementing data cubes efficiently , 1996, SIGMOD '96.

[12]  Roberto Palmieri,et al.  Automated Data Partitioning for Highly Scalable and Strongly Consistent Transactions , 2016, IEEE Trans. Parallel Distributed Syst..

[13]  Matthew Huras,et al.  Efficient Query Processing for Multi-Dimensionally Clustered Tables in DB2 , 2003, VLDB.

[14]  Vivek R. Narasayya,et al.  Integrating vertical and horizontal partitioning into automated physical database design , 2004, SIGMOD '04.

[15]  Jie Wu,et al.  Theory and Network Applications of Dynamic Bloom Filters , 2006, Proceedings IEEE INFOCOM 2006. 25TH IEEE International Conference on Computer Communications.

[16]  Lu Li,et al.  Optimizing B+-Tree for PCM-Based Hybrid Memory , 2016, EDBT.

[17]  Guido Moerkotte,et al.  Small Materialized Aggregates: A Light Weight Index Structure for Data Warehousing , 1998, VLDB.

[18]  Inderpal Singh Mumick,et al.  Selection of Views to Materialize Under a Maintenance Cost Constraint , 1999, ICDT.

[19]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[20]  Liwen Sun,et al.  Fine-grained partitioning for aggressive data skipping , 2014, SIGMOD Conference.

[21]  Piotr Synak,et al.  Brighthouse: an analytic data warehouse for ad-hoc queries , 2008, Proc. VLDB Endow..

[22]  Hans-Arno Jacobsen,et al.  A Hybrid B+-tree as Solution for In-Memory Indexing on CPU-GPU Heterogeneous Computing Platforms , 2016, SIGMOD Conference.