A Handbook for Building an Approximate Query Engine

There has been much research on various aspects of Approximate Query Processing (AQP), such as different sampling strategies, error estimation mechanisms, and various types of data synopses. However, many subtle challenges arise when building an actual AQP engine that can be deployed and used by real world applications. These subtleties are often ignored (or at least not elaborated) by the theoretical literature and academic prototypes alike. For the first time to the best of our knowledge, in this article, we focus on these subtle challenges that one must address when designing an AQP system. Our intention for this article is to serve as a handbook listing critical design choices that database practitioners must be aware of when building or using an AQP system, not to prescribe a specific solution to each challenge.

[1]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[2]  Anirban Dasgupta,et al.  Sampling algorithms and coresets for ℓp regression , 2007, SODA '08.

[3]  Danyel Fisher,et al.  Incremental, approximate database queries and uncertainty for exploratory visualization , 2011, 2011 IEEE Symposium on Large Data Analysis and Visualization.

[4]  D. Horvitz,et al.  A Generalization of Sampling Without Replacement from a Finite Universe , 1952 .

[5]  Carlo Zaniolo,et al.  SMM: A data stream management system for knowledge discovery , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[6]  Surajit Chaudhuri,et al.  Towards a robust query optimizer: a principled and practical approach , 2005, SIGMOD '05.

[7]  Rainer Gemulla,et al.  Sampling algorithms for evolving datasets , 2008 .

[8]  Michael J. Cafarella,et al.  Visualization-aware sampling for very large databases , 2015, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[9]  Ion Stoica,et al.  Blink and It's Done: Interactive Queries on Very Large Data , 2012, Proc. VLDB Endow..

[10]  Carlo Zaniolo,et al.  Early Accurate Results for Advanced Analytics on MapReduce , 2012, Proc. VLDB Endow..

[11]  Chris Jermaine,et al.  Robust Stratified Sampling Plans for Low Selectivity Queries , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[12]  Ameet Talwalkar,et al.  Knowing when you're wrong: building fast and reliable approximate query processing systems , 2014, SIGMOD Conference.

[13]  Ion Stoica,et al.  G-OLA: Generalized On-Line Aggregation for Interactive Analysis on Big Data , 2015, SIGMOD Conference.

[14]  Florin Rusu,et al.  PF-OLA: a high-performance framework for parallel online aggregation , 2012, Distributed and Parallel Databases.

[15]  Ying Hu,et al.  Estimating aggregates in time-constrained approximate queries in Oracle , 2009, EDBT '09.

[16]  Sunil Arya,et al.  Approximate range searching , 1995, SCG '95.

[17]  Ameet Talwalkar,et al.  Foundations of Machine Learning , 2012, Adaptive computation and machine learning.

[18]  Fei Xu,et al.  Turbo-Charging Estimate Convergence in DBO , 2009, Proc. VLDB Endow..

[19]  M. Ruiz Espejo Sampling , 2013, Encyclopedic Dictionary of Archaeology.

[20]  Viswanath Poosala,et al.  Aqua: A Fast Decision Support Systems Using Approximate Query Answers , 1999, VLDB.

[21]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[22]  Purnamrita Sarkar,et al.  Scaling Up Crowd-Sourcing to Very Large Datasets: A Case for Active Learning , 2014, Proc. VLDB Endow..

[23]  Jean-Paul Chilès,et al.  Wiley Series in Probability and Statistics , 2012 .

[24]  Sudipto Guha,et al.  Wavelet synopsis for data streams: minimizing non-euclidean error , 2005, KDD '05.

[25]  Brooke A. Jude A ‘Case’ for Active Learning , 2012 .

[26]  Byung Suk Lee,et al.  Stratified Reservoir Sampling over Heterogeneous Data Streams , 2010, SSDBM.

[27]  Sridhar Ramaswamy,et al.  Join synopses for approximate query answering , 1999, SIGMOD '99.

[28]  Joobin Choobineh,et al.  An object-oriented semantic data model , 1990 .

[29]  Ronitt Rubinfeld,et al.  Rapid Sampling for Visualizations with Ordering Guarantees , 2014, Proc. VLDB Endow..

[30]  Yves Tillé,et al.  Sampling Algorithms , 2011, International Encyclopedia of Statistical Science.

[31]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[32]  Beng Chin Ooi,et al.  Continuous sampling for online aggregation over multiple queries , 2010, SIGMOD Conference.

[33]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[34]  Sariel Har-Peled,et al.  On coresets for k-means and k-median clustering , 2004, STOC '04.

[35]  Eugene Zhen Ye Goh,et al.  CliffGuard : An Extended Report ∗ , 2015 .

[36]  Carlo Zaniolo,et al.  ABS: a system for scalable approximate queries with accuracy guarantees , 2014, SIGMOD Conference.

[37]  Martin L. Kersten,et al.  SciBORQ: Scientific data management with Bounds On Runtime and Quality , 2011, CIDR.

[38]  Michael J. Cafarella,et al.  Neighbor-Sensitive Hashing , 2015, Proc. VLDB Endow..

[39]  Surajit Chaudhuri,et al.  Optimized stratified sampling for approximate query processing , 2007, TODS.

[40]  Jignesh M. Patel,et al.  DAQ: A New Paradigm for Approximate Query Processing , 2015, Proc. VLDB Endow..

[41]  Xiangrui Meng,et al.  Scalable Simple Random Sampling and Stratified Sampling , 2013, ICML.

[42]  Viswanath Poosala,et al.  Congressional samples for approximate answering of group-by queries , 2000, SIGMOD '00.

[43]  Jeffrey Heer,et al.  imMens: Real‐time Visual Querying of Big Data , 2013, Comput. Graph. Forum.

[44]  Barzan Mozafari,et al.  CliffGuard: A Principled Framework for Finding Robust Database Designs , 2015, SIGMOD Conference.

[45]  Robert B. Miller,et al.  Response time in man-computer conversational transactions , 1899, AFIPS Fall Joint Computing Conference.

[46]  Surajit Chaudhuri,et al.  Dynamic sample selection for approximate query processing , 2003, SIGMOD '03.

[47]  Michael I. Jordan,et al.  Computational and statistical tradeoffs via convex relaxation , 2012, Proceedings of the National Academy of Sciences.

[48]  Carlo Zaniolo,et al.  The analytical bootstrap: a new method for fast error estimation in approximate query processing , 2014, SIGMOD Conference.

[49]  Alexander J. Smola,et al.  Hokusai - Sketching Streams in Real Time , 2012, UAI.

[50]  Ion Stoica,et al.  BlinkDB: queries with bounded errors and bounded response times on very large data , 2012, EuroSys '13.

[51]  Barzan Mozafari Verdict: A System for Stochastic Query Planning , 2015, CIDR.

[52]  Chris Jermaine,et al.  Sampling-based estimators for subset-based queries , 2008, The VLDB Journal.

[53]  Chris Jermaine,et al.  Relational confidence bounds are easy with the bootstrap , 2005, SIGMOD '05.

[54]  Peter Bajorski,et al.  Wiley Series in Probability and Statistics , 2010 .

[55]  Christopher Olston,et al.  Generating example data for dataflow programs , 2009, SIGMOD Conference.

[56]  Rajeev Motwani,et al.  Towards estimation error guarantees for distinct values , 2000, PODS.

[57]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[58]  Chris Jermaine,et al.  Online aggregation for large MapReduce jobs , 2011, Proc. VLDB Endow..

[59]  P. Flajolet,et al.  HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm , 2007 .

[60]  Helen J. Wang,et al.  Online aggregation , 1997, SIGMOD '97.

[61]  Carlo Zaniolo,et al.  Optimal load shedding with aggregates and mining queries , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[62]  Jorma Laurikkala,et al.  Improving Identification of Difficult Small Classes by Balancing Class Distribution , 2001, AIME.