Rapid Approximate Aggregation with Distribution-Sensitive Interval Guarantees

Aggregating data is fundamental to data analytics, data exploration, and OLAP. Approximate query processing (AQP) techniques are often used to accelerate computation of aggregates using samples, for which confidence intervals (CIs) are widely used to quantify the associated error. CIs used in practice fall into two categories: techniques that are tight but not correct, i.e., they yield tight intervals but only offer asymptotic guarantees, making them unreliable, or techniques that are correct but not tight, i.e., they offer rigorous guarantees, but are overly conservative, leading to confidence intervals that are too loose to be useful. In this paper, we develop a CI technique that is both correct and tighter than traditional approaches. Starting from conservative CIs, we identify two issues they often face: pessimistic mass allocation (PMA) and phantom outlier sensitivity (PHOS). By developing a novel range-trimming technique for eliminating PHOS and pairing it with known CI techniques without PMA, we develop a technique for computing CIs with strong guarantees that requires fewer samples for the same width. We implement our techniques underneath a sampling-optimized in-memory column store and show how to accelerate queries involving aggregates on a real dataset with speedups of up to 124x over traditional AQP-with-guarantees and more than 1000x over exact methods.

[1]  Bin Wu,et al.  Wander Join: Online Aggregation via Random Walks , 2016, SIGMOD Conference.

[2]  Danyel Fisher,et al.  Incremental, approximate database queries and uncertainty for exploratory visualization , 2011, 2011 IEEE Symposium on Large Data Analysis and Visualization.

[3]  Karthik Ramachandra,et al.  Aggify: Lifting the Curse of Cursor Loops using Custom Aggregates , 2020, SIGMOD Conference.

[4]  Chris Arney Probably Approximately Correct: Nature's Algorithms for Learning and Prospering in a Complex World , 2014 .

[5]  Chris Jermaine,et al.  Relational confidence bounds are easy with the bootstrap , 2005, SIGMOD '05.

[6]  Mikkel Thorup,et al.  Confidence intervals for priority sampling , 2006, SIGMETRICS '06/Performance '06.

[7]  Wen-Chi Hou,et al.  Processing aggregate relational queries with hard time constraints , 1989, SIGMOD '89.

[8]  Guoliang Li,et al.  Approximate Query Processing: What is New and Where to Go? , 2018, Data Science and Engineering.

[9]  Surajit Chaudhuri,et al.  Dynamic sample selection for approximate query processing , 2003, SIGMOD '03.

[10]  Stefano Ermon,et al.  Adaptive Concentration Inequalities for Sequential Decision Problems , 2016, NIPS.

[11]  Yannis E. Ioannidis,et al.  Bitmap index design and evaluation , 1998, SIGMOD '98.

[12]  R. Serfling Probability Inequalities for the Sum in Sampling without Replacement , 1974 .

[13]  Barzan Mozafari,et al.  Approximate Query Engines: Commercial Challenges and Research Opportunities , 2017, SIGMOD Conference.

[14]  Eugene Wu,et al.  PFunk-H: approximate query processing using perceptual models , 2016, HILDA '16.

[15]  Carsten Lund,et al.  Priority sampling for estimation of arbitrary subset sums , 2007, JACM.

[16]  B. Welford Note on a Method for Calculating Corrected Sums of Squares and Products , 1962 .

[17]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[18]  Beng Chin Ooi,et al.  Continuous sampling for online aggregation over multiple queries , 2010, SIGMOD Conference.

[19]  Noga Alon,et al.  Estimating arbitrary subset sums with few probes , 2005, PODS '05.

[20]  Ion Stoica,et al.  G-OLA: Generalized On-Line Aggregation for Interactive Analysis on Big Data , 2015, SIGMOD Conference.

[21]  P. Massart The Tight Constant in the Dvoretzky-Kiefer-Wolfowitz Inequality , 1990 .

[22]  Gene H. Golub,et al.  Algorithms for Computing the Sample Variance: Analysis and Recommendations , 1983 .

[23]  Wen-Chi Hou,et al.  Statistical estimators for relational algebra expressions , 1988, PODS '88.

[24]  T. W. Anderson CONFIDENCE LIMITS FOR THE EXPECTED VALUE OF AN ARBITRARY BOUNDED RANDOM VARIABLE WITH A CONTINUOUS DISTRIBUTION FUNCTION , 1969 .

[25]  Ronitt Rubinfeld,et al.  Rapid Sampling for Visualizations with Ordering Guarantees , 2014, Proc. VLDB Endow..

[26]  C. Esseen,et al.  A moment inequality with an application to the central limit theorem , 1956 .

[27]  Florin Rusu,et al.  PF-OLA: a high-performance framework for parallel online aggregation , 2012, Distributed and Parallel Databases.

[28]  George S. Fishman,et al.  Confidence intervals for the mean in the bounded case , 1991 .

[29]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[30]  Sridhar Ramaswamy,et al.  The Aqua approximate query answering system , 1999, SIGMOD '99.

[31]  A. C. Berry The accuracy of the Gaussian approximation to the sum of independent variates , 1941 .

[32]  Barzan Mozafari,et al.  A Handbook for Building an Approximate Query Engine , 2015, IEEE Data Eng. Bull..

[33]  Odalric-Ambrym Maillard,et al.  Concentration inequalities for sampling without replacement , 2013, 1309.4029.

[34]  R. F. Ling Comparison of Several Algorithms for Computing Sample Means and Variances , 1974 .

[35]  Frank Olken,et al.  Random Sampling from Databases , 1993 .

[36]  Arie Shoshani,et al.  Compressed bitmap indices for efficient query processing , 2001 .

[37]  J. Andel Sequential Analysis , 2022, The SAGE Encyclopedia of Research Design.

[38]  Jignesh M. Patel,et al.  DAQ: A New Paradigm for Approximate Query Processing , 2015, Proc. VLDB Endow..

[39]  Barzan Mozafari,et al.  VerdictDB: Universalizing Approximate Query Processing , 2018, SIGMOD Conference.

[40]  Robert L. Wolpert,et al.  Statistical Inference , 2019, Encyclopedia of Social Network Analysis and Mining.

[41]  Student,et al.  THE PROBABLE ERROR OF A MEAN , 1908 .

[42]  Helen J. Wang,et al.  Online aggregation , 1997, SIGMOD '97.

[43]  Surajit Chaudhuri,et al.  Optimized stratified sampling for approximate query processing , 2007, TODS.

[44]  Mong-Li Lee,et al.  ICICLES: Self-Tuning Samples for Approximate Query Answering , 2000, VLDB.

[45]  Chris Jermaine,et al.  Scalable approximate query processing with the DBO engine , 2008, TODS.

[46]  Ronitt Rubinfeld,et al.  I've Seen "Enough": Incrementally Improving Visualizations to Support Rapid Decision Making , 2017, Proc. VLDB Endow..

[47]  Peter J. Haas,et al.  Ripple joins for online aggregation , 1999, SIGMOD '99.

[48]  Ion Stoica,et al.  Blink and It's Done: Interactive Queries on Very Large Data , 2012, Proc. VLDB Endow..

[49]  Anna Kuchment,et al.  Probably approximately correct: Nature's algorithms for learning and prospering in a complex world. , 2013 .

[50]  Jeffrey F. Naughton,et al.  Estimating the Size of Generalized Transitive Closures , 1989, VLDB.

[51]  M. Wainwright Basic tail and concentration bounds , 2019, High-Dimensional Statistics.

[52]  J. Kiefer,et al.  Asymptotic Minimax Character of the Sample Distribution Function and of the Classical Multinomial Estimator , 1956 .

[53]  Carlo Zaniolo,et al.  The analytical bootstrap: a new method for fast error estimation in approximate query processing , 2014, SIGMOD Conference.

[54]  Aditya G. Parameswaran,et al.  Adaptive Sampling for Rapidly Matching Histograms , 2017, Proc. VLDB Endow..

[55]  Surajit Chaudhuri,et al.  Sample + Seek: Approximating Aggregates with Distribution Precision Guarantee , 2016, SIGMOD Conference.

[56]  Peter J. Haas,et al.  Sampling for Scalable Visual Analytics , 2017, IEEE Computer Graphics and Applications.

[57]  Ion Stoica,et al.  BlinkDB: queries with bounded errors and bounded response times on very large data , 2012, EuroSys '13.

[58]  Chen Chen,et al.  Effective Order Preserving Estimation Method , 2016, ADC.

[59]  Jeffrey F. Naughton,et al.  Efficient Sampling Strategies for Relational Database Operations , 1993, Theor. Comput. Sci..

[60]  J. Wolfowitz,et al.  Confidence Limits for Continuous Distribution Functions , 1939 .

[61]  Aditya G. Parameswaran,et al.  Optimally Leveraging Density and Locality for Exploratory Browsing and Sampling , 2018, HILDA@SIGMOD.

[62]  Peter J. Haas,et al.  Hoeffding inequalities for join-selectivity estimation and online aggregation , 1996 .

[63]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[64]  Rajeev Motwani,et al.  Overcoming limitations of sampling for aggregation queries , 2001, Proceedings 17th International Conference on Data Engineering.

[65]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[66]  Christopher Ré,et al.  Towards a unified architecture for in-RDBMS analytics , 2012, SIGMOD Conference.

[67]  Peter J. Haas,et al.  Large-sample and deterministic confidence intervals for online aggregation , 1997, Proceedings. Ninth International Conference on Scientific and Statistical Database Management (Cat. No.97TB100150).

[68]  Jeffrey F. Naughton,et al.  Practical selectivity estimation through adaptive sampling , 1990, SIGMOD '90.