Combining Aggregation and Sampling (Nearly) Optimally for Approximate Query Processing

Sample-based approximate query processing (AQP) suffers from many pitfalls such as the inability to answer very selective queries and unreliable confidence intervals when sample sizes are small. Recent research presented an intriguing solution of combining materialized, pre-computed aggregates with sampling for accurate and more reliable AQP. We explore this solution in detail in this work and propose an AQP physical design called PASS, or Precomputation-Assisted Stratified Sampling. PASS builds a tree of partial aggregates that cover different partitions of the dataset. The leaf nodes of this tree form the strata for stratified samples. Aggregate queries whose predicates align with the partitions (or unions of partitions) are exactly answered with a depth-first search, and any partial overlaps are approximated with the stratified samples. We propose an algorithm for optimally partitioning the data into such a data structure with various practical approximation techniques.

[1]  Beng Chin Ooi,et al.  Global optimization of histograms , 2001, SIGMOD '01.

[2]  Barzan Mozafari,et al.  VerdictDB: Universalizing Approximate Query Processing , 2018, SIGMOD Conference.

[3]  Srikanth Kandula,et al.  Approximate partition selection for big-data workloads using summary statistics , 2020, Proc. VLDB Endow..

[4]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[5]  Mong-Li Lee,et al.  ICICLES: Self-Tuning Samples for Approximate Query Answering , 2000, VLDB.

[6]  Ion Stoica,et al.  BlinkDB: queries with bounded errors and bounded response times on very large data , 2012, EuroSys '13.

[7]  Arnab Nandi,et al.  Distributed and interactive cube exploration , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[8]  Viswanath Poosala,et al.  Congressional samples for approximate answering of group-by queries , 2000, SIGMOD '00.

[9]  Divesh Srivastava,et al.  Optimal histograms for hierarchical range queries , 2000, ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems.

[10]  Xi Chen,et al.  Deep Unsupervised Cardinality Estimation , 2019, Proc. VLDB Endow..

[11]  Tim Kraska,et al.  Stale View Cleaning: Getting Fresh Answers from Stale Materialized Views , 2015, Proc. VLDB Endow..

[12]  Helen J. Wang,et al.  Online aggregation , 1997, SIGMOD '97.

[13]  Feifei Li,et al.  Random Sampling over Joins Revisited , 2018, SIGMOD Conference.

[14]  Srikanth Kandula,et al.  Approximate Query Processing: No Silver Bullet , 2017, SIGMOD Conference.

[15]  Yossi Matias,et al.  Fast incremental maintenance of approximate histograms , 1997, TODS.

[16]  Sanjay Krishnan,et al.  Fast and Reliable Missing Data Contingency Analysis with Predicate-Constraints , 2020, SIGMOD Conference.

[17]  Srikanth Kandula,et al.  Experiences with Approximating Queries in Microsoft's Production Big-Data Clusters , 2019, Proc. VLDB Endow..

[18]  Surajit Chaudhuri,et al.  Sample + Seek: Approximating Aggregates with Distribution Precision Guarantee , 2016, SIGMOD Conference.

[19]  Torsten Suel,et al.  Optimal Histograms with Quality Guarantees , 1998, VLDB.

[20]  Graham Cormode,et al.  Mergeable summaries , 2012, PODS '12.

[21]  Tim Kraska,et al.  How Progressive Visualizations Affect Exploratory Analysis , 2017, IEEE Transactions on Visualization and Computer Graphics.

[22]  Srikanth Kandula,et al.  Quickr: Lazily Approximating Complex AdHoc Queries in BigData Clusters , 2016, SIGMOD Conference.

[23]  Surajit Chaudhuri,et al.  A robust, optimization-based approach for approximate answering of aggregate queries , 2001, SIGMOD '01.

[24]  Edward Gan,et al.  CoopStore: Optimizing Precomputed Summaries for Aggregation , 2020, Proc. VLDB Endow..

[25]  Liwen Sun,et al.  Fine-grained partitioning for aggressive data skipping , 2014, SIGMOD Conference.

[26]  Jian Pei,et al.  AQP++: Connecting Approximate Query Processing With Aggregate Precomputation for Interactive Analytics , 2018, SIGMOD Conference.

[27]  Surajit Chaudhuri,et al.  Optimized stratified sampling for approximate query processing , 2007, TODS.

[28]  Doron Rotem,et al.  Simple Random Sampling from Relational Databases , 1986, VLDB.

[29]  Rajeev Motwani,et al.  Overcoming limitations of sampling for aggregation queries , 2001, Proceedings 17th International Conference on Data Engineering.

[30]  Jeffrey Heer,et al.  imMens: Real‐time Visual Querying of Big Data , 2013, Comput. Graph. Forum.

[31]  Michael J. Cafarella,et al.  Database Learning: Toward a Database that Becomes Smarter Every Time , 2017, SIGMOD Conference.

[32]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[33]  Surajit Chaudhuri,et al.  Dynamic sample selection for approximate query processing , 2003, SIGMOD '03.

[34]  Tim Kraska,et al.  A sample-and-clean framework for fast and accurate query processing on dirty data , 2014, SIGMOD Conference.

[35]  Carsten Binnig,et al.  Revisiting Reuse for Approximate Query Processing , 2017, Proc. VLDB Endow..

[36]  Raghu Ramakrishnan,et al.  Dynamic Histograms: Capturing Evolving Data Sets , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[37]  Ruoming Jin,et al.  New Sampling-Based Estimators for OLAP Queries , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[38]  Sharad Mehrotra,et al.  Progressive approximate aggregate queries with a multi-resolution tree structure , 2001, SIGMOD '01.

[39]  F. Frances Yao,et al.  Computational Geometry , 1991, Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity.

[40]  Chris Jermaine,et al.  Robust Estimation With Sampling and Approximate Pre-Aggregation , 2003, VLDB.

[41]  Stavros Sintos,et al.  Learning to Sample: Counting with Complex Queries , 2019, Proc. VLDB Endow..