Approximate Query Processing: What is New and Where to Go?

AbstractOnline analytical processing (OLAP) is a core functionality in database systems. The performance of OLAP is crucial to make online decisions in many applications. However, it is rather costly to support OLAP on large datasets, especially big data, and the methods that compute exact answers cannot meet the high-performance requirement. To alleviate this problem, approximate query processing (AQP) has been proposed, which aims to find an approximate answer as close as to the exact answer efficiently. Existing AQP techniques can be broadly categorized into two categories. (1) Online aggregation: select samples online and use these samples to answer OLAP queries. (2) Offline synopses generation: generate synopses offline based on a-priori knowledge (e.g., data statistics or query workload) and use these synopses to answer OLAP queries. We discuss the research challenges in AQP and summarize existing techniques to address these challenges. In addition, we review how to use AQP to support other complex data types, e.g., spatial data and trajectory data, and support other applications, e.g., data visualization and data cleaning. We also introduce existing AQP systems and summarize their advantages and limitations. Lastly, we provide research challenges and opportunities of AQP. We believe that the survey can help the partitioners to understand existing AQP techniques and select appropriate methods in their applications.

[1]  Jayant Madhavan,et al.  Efficient spatial sampling of large geographical tables , 2012, SIGMOD Conference.

[2]  Victor Vianu,et al.  Views and queries: Determinacy and rewriting , 2010, TODS.

[3]  Fei Xu,et al.  Turbo-Charging Estimate Convergence in DBO , 2009, Proc. VLDB Endow..

[4]  Bin Wu,et al.  Wander Join and XDB , 2019, ACM Trans. Database Syst..

[5]  Arnab Nandi,et al.  Combining User Interaction, Speculative Query Execution and Sampling in the DICE System , 2014, Proc. VLDB Endow..

[6]  Tim Kraska,et al.  Stale View Cleaning: Getting Fresh Answers from Stale Materialized Views , 2015, Proc. VLDB Endow..

[7]  Peter J. Haas,et al.  Ripple joins for online aggregation , 1999, SIGMOD '99.

[8]  Sridhar Ramaswamy,et al.  The Aqua approximate query answering system , 1999, SIGMOD '99.

[9]  Gregory Piatetsky-Shapiro,et al.  Accurate estimation of the number of tuples satisfying a condition , 1984, SIGMOD '84.

[10]  Peter J. Haas,et al.  A bi-level Bernoulli scheme for database sampling , 2004, SIGMOD '04.

[11]  Johann Gamper,et al.  DigitHist: a Histogram-Based Data Summary with Tight Error Bounds , 2017, Proc. VLDB Endow..

[12]  Johannes Gehrke,et al.  Querying and mining data streams: you only get one look a tutorial , 2002, SIGMOD '02.

[13]  Frank Olken,et al.  Random Sampling from Databases , 1993 .

[14]  Surajit Chaudhuri,et al.  A robust, optimization-based approach for approximate answering of aggregate queries , 2001, SIGMOD '01.

[15]  Ping Lu,et al.  Querying Big Data by Accessing Small Data , 2015, PODS.

[16]  Dan Grossman,et al.  EnerJ: approximate data types for safe and general low-power computation , 2011, PLDI '11.

[17]  Tim Kraska,et al.  Generalized scale independence through incremental precomputation , 2013, SIGMOD '13.

[18]  Helen J. Wang,et al.  Online aggregation , 1997, SIGMOD '97.

[19]  Chinmay Hegde,et al.  Fast and Near-Optimal Algorithms for Approximating Distributions by Histograms , 2015, PODS.

[20]  Tim Kraska,et al.  A sample-and-clean framework for fast and accurate query processing on dirty data , 2014, SIGMOD Conference.

[21]  Barzan Mozafari,et al.  SnappyData: A Unified Cluster for Streaming, Transactions and Interactice Analytics , 2017, CIDR.

[22]  Ihab F. Ilyas,et al.  Data Cleaning: Overview and Emerging Challenges , 2016, SIGMOD Conference.

[23]  Carlo Zaniolo,et al.  ABS: a system for scalable approximate queries with accuracy guarantees , 2014, SIGMOD Conference.

[24]  Shouling Ji,et al.  Sapprox: Enabling Efficient and Accurate Approximations on Sub-datasets with Distribution-aware Online Sampling , 2016, Proc. VLDB Endow..

[25]  Jacob Nelson,et al.  Approximate storage in solid-state memories , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[26]  Rob A. Rutenbar,et al.  Reducing power by optimizing the necessary precision/range of floating-point arithmetic , 2000, IEEE Trans. Very Large Scale Integr. Syst..

[27]  Bin Wu,et al.  Wander Join: Online Aggregation via Random Walks , 2016, SIGMOD Conference.

[28]  Jeffrey F. Naughton,et al.  Selectivity and Cost Estimation for Joins Based on Random Sampling , 1996, J. Comput. Syst. Sci..

[29]  Stavros Christodoulakis,et al.  Optimal histograms for limiting worst-case error propagation in the size of join results , 1993, TODS.

[30]  T. Hesterberg,et al.  What Teachers Should Know About the Bootstrap: Resampling in the Undergraduate Statistics Curriculum , 2014, The American statistician.

[31]  Tim Kraska,et al.  SampleClean: Fast and Reliable Analytics on Dirty Data , 2015, IEEE Data Eng. Bull..

[32]  Michael J. Cafarella,et al.  Database Learning: Toward a Database that Becomes Smarter Every Time , 2017, SIGMOD Conference.

[33]  Barbara Catania,et al.  Approximate Queries for Spatial Data , 2013, Advanced Query Processing.

[34]  Philippe Flajolet,et al.  Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[35]  Xiaojie Liu,et al.  Approximate Calculation of Window Aggregate Functions via Global Random Sample , 2018, Data Science and Engineering.

[36]  Boris Cule,et al.  Space-Bounded Query Approximation , 2015, ADBIS.

[37]  Xin Wang,et al.  Querying big graphs within bounded resources , 2014, SIGMOD Conference.

[38]  Wenfei Fan,et al.  Data Driven Approximation with Bounded Resources , 2017, Proc. VLDB Endow..

[39]  Neeraj Kumar,et al.  SnappyData: A Hybrid Transactional Analytical Store Built On Spark , 2016, SIGMOD Conference.

[40]  Lu Wang,et al.  Indexing for summary queries , 2014, ACM Trans. Database Syst..

[41]  Zheng Zhang,et al.  Error-bounded Sampling for Analytics on Big Sparse Data , 2014, Proc. VLDB Endow..

[42]  Alon Y. Halevy,et al.  Answering queries using views: A survey , 2001, The VLDB Journal.

[43]  Qiang Yang,et al.  Sampling Big Trajectory Data , 2015, CIKM.

[44]  Peter J. Haas,et al.  Improved histograms for selectivity estimation of range predicates , 1996, SIGMOD '96.

[45]  Surajit Chaudhuri,et al.  Optimized stratified sampling for approximate query processing , 2007, TODS.

[46]  Chris Jermaine,et al.  Relational confidence bounds are easy with the bootstrap , 2005, SIGMOD '05.

[47]  Thu D. Nguyen,et al.  ApproxHadoop: Bringing Approximations to MapReduce Frameworks , 2015, ASPLOS.

[48]  Sudipto Guha,et al.  Wavelet synopsis for data streams: minimizing non-euclidean error , 2005, KDD '05.

[49]  Eugene Wu,et al.  PFunk-H: approximate query processing using perceptual models , 2016, HILDA '16.

[50]  Liang Lu,et al.  A green framework for DBMS based on energy-aware query optimization and energy-efficient query processing , 2017, J. Netw. Comput. Appl..

[51]  David Vengerov,et al.  Join Size Estimation Subject to Filter Conditions , 2015, Proc. VLDB Endow..

[52]  M. Habib Probabilistic methods for algorithmic discrete mathematics , 1998 .

[53]  Srikanth Kandula,et al.  Quickr: Lazily Approximating Complex AdHoc Queries in BigData Clusters , 2016, SIGMOD Conference.

[54]  Moo K. Chung,et al.  Multi-resolutional shape features via non-Euclidean wavelets: Applications to statistical analysis of cortical thickness , 2014, NeuroImage.

[55]  Guillaume Pitel,et al.  Count-Min-Log sketch: Approximately counting with approximate counters , 2015, ArXiv.

[56]  Florin Rusu,et al.  PF-OLA: a high-performance framework for parallel online aggregation , 2012, Distributed and Parallel Databases.

[57]  Bingsheng He,et al.  A Study of Sorting Algorithms on Approximate Memory , 2016, SIGMOD Conference.

[58]  Srikanth Kandula,et al.  Approximate Query Processing: No Silver Bullet , 2017, SIGMOD Conference.

[59]  Yu Zheng,et al.  Trajectory Data Mining , 2015, ACM Trans. Intell. Syst. Technol..

[60]  Qin Zhang,et al.  Bias-Aware Sketches , 2016, Proc. VLDB Endow..

[61]  Graham Cormode,et al.  Sketch Techniques for Approximate Query Processing , 2010 .

[62]  Arnab Nandi,et al.  Distributed and interactive cube exploration , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[63]  Shantanu H. Joshi,et al.  Materialized Sample Views for Database Approximation , 2008, IEEE Trans. Knowl. Data Eng..

[64]  Ahmed Eldawy,et al.  The era of Big Spatial Data , 2016, ICDE.

[65]  Carlo Zaniolo,et al.  Early Accurate Results for Advanced Analytics on MapReduce , 2012, Proc. VLDB Endow..

[66]  Jignesh M. Patel,et al.  DAQ: A New Paradigm for Approximate Query Processing , 2015, Proc. VLDB Endow..

[67]  Doron Rotem,et al.  Simple Random Sampling from Relational Databases , 1986, VLDB.

[68]  Torben Bach Pedersen,et al.  OLAP over probabilistic data cubes I: Aggregating, materializing, and querying , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[69]  Monica M. C. Schraefel,et al.  Trust me, i'm partially right: incremental visualization lets analysts explore large datasets faster , 2012, CHI.

[70]  Chris Jermaine,et al.  A Sampling Algebra for Aggregate Estimation , 2013, Proc. VLDB Endow..

[71]  Barzan Mozafari,et al.  Approximate Query Engines: Commercial Challenges and Research Opportunities , 2017, SIGMOD Conference.

[72]  Ronitt Rubinfeld,et al.  I've Seen "Enough": Incrementally Improving Visualizations to Support Rapid Decision Making , 2017, Proc. VLDB Endow..

[73]  Yanmin Zhu,et al.  A Survey on Trajectory Data Mining: Techniques and Applications , 2016, IEEE Access.

[74]  Bin Wu,et al.  Wander Join: Online Aggregation for Joins , 2016, SIGMOD Conference.

[75]  Badrish Chandramouli,et al.  Scalable Progressive Analytics on Big Data in the Cloud , 2013, Proc. VLDB Endow..

[76]  Cong Yu,et al.  Efficient Evaluation of Object-Centric Exploration Queries for Visualization , 2015, Proc. VLDB Endow..

[77]  Hong Su,et al.  Approximate Aggregates in Oracle 12C , 2016, CIKM.

[78]  Graham Cormode,et al.  Probabilistic Histograms for Probabilistic Data , 2009, Proc. VLDB Endow..

[79]  Yannis E. Ioannidis,et al.  Universality of Serial Histograms , 1993, VLDB.

[80]  Bingsheng He,et al.  When Data Management Systems Meet Approximate Hardware: Challenges and Opportunities , 2014, Proc. VLDB Endow..

[81]  Rajeev Motwani,et al.  On random sampling over joins , 1999, SIGMOD '99.

[82]  Carlo Zaniolo,et al.  The analytical bootstrap: a new method for fast error estimation in approximate query processing , 2014, SIGMOD Conference.

[83]  Dimitrios Tsoumakos,et al.  Distributed Wavelet Thresholding for Maximum Error Metrics , 2016, SIGMOD Conference.

[84]  Graham Cormode,et al.  Sketch Algorithms for Estimating Point Queries in NLP , 2012, EMNLP.

[85]  An efficient architecture for HWT using sparse matrix factorisation and DA principles , 2008, APCCAS 2008 - 2008 IEEE Asia Pacific Conference on Circuits and Systems.

[86]  Surajit Chaudhuri,et al.  Automated Selection of Materialized Views and Indexes in SQL Databases , 2000, VLDB.

[87]  Frank Neven,et al.  Making Queries Tractable on Big Data with Preprocessing , 2013, Proc. VLDB Endow..

[88]  Surajit Chaudhuri,et al.  Sample + Seek: Approximating Aggregates with Distribution Precision Guarantee , 2016, SIGMOD Conference.

[89]  Michael A. Bender,et al.  A General-Purpose Counting Filter: Making Every Bit Count , 2017, SIGMOD Conference.

[90]  Michael J. Cafarella,et al.  Visualization-aware sampling for very large databases , 2015, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[91]  Sridhar Ramaswamy,et al.  Join synopses for approximate query answering , 1999, SIGMOD '99.

[92]  Wenfei Fan,et al.  On scale independence for querying big data , 2014, PODS.

[93]  Srikanth Kandula Errata and Proofs for Quickr , 2017 .

[94]  Ion Stoica,et al.  BlinkDB: queries with bounded errors and bounded response times on very large data , 2012, EuroSys '13.

[95]  Barzan Mozafari Verdict: A System for Stochastic Query Planning , 2015, CIDR.

[96]  Aditya G. Parameswaran,et al.  Adaptive Sampling for Rapidly Matching Histograms , 2017, Proc. VLDB Endow..

[97]  Bolin Ding,et al.  Trust, but Verify: Optimistic Visualizations of Approximate Queries for Exploring Big Data , 2017, CHI.

[98]  Moo K. Chung,et al.  Multi-resolutional Brain Network Filtering and Analysis via Wavelets on Non-Euclidean Space , 2013, MICCAI.

[99]  Moo K. Chung,et al.  Multi-resolution Shape Analysis via Non-Euclidean Wavelets: Applications to Mesh Segmentation and Surface Alignment Problems , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[100]  Ameet Talwalkar,et al.  Knowing when you're wrong: building fast and reliable approximate query processing systems , 2014, SIGMOD Conference.

[101]  Ion Stoica,et al.  G-OLA: Generalized On-Line Aggregation for Interactive Analysis on Big Data , 2015, SIGMOD Conference.

[102]  Ronitt Rubinfeld,et al.  Rapid Sampling for Visualizations with Ordering Guarantees , 2014, Proc. VLDB Endow..

[103]  Tim Kraska,et al.  Approximate Query Processing for Interactive Data Science , 2017, SIGMOD Conference.

[104]  Lu Wang,et al.  Spatial Online Sampling and Aggregation , 2015, Proc. VLDB Endow..

[105]  Chris Jermaine,et al.  Scalable approximate query processing with the DBO engine , 2007, SIGMOD '07.

[106]  Peng Zhang,et al.  Bus-OLAP: A Data Management Model for Non-on-Time Events Query Over Bus Journey Data , 2018, Data Science and Engineering.

[107]  Tianyu Wo,et al.  Bounded Conjunctive Queries , 2014, Proc. VLDB Endow..

[108]  Wenfei Fan,et al.  An Effective Syntax for Bounded Relational Queries , 2016, SIGMOD Conference.

[109]  Surajit Chaudhuri,et al.  Dynamic sample selection for approximate query processing , 2003, SIGMOD '03.

[110]  Carsten Binnig,et al.  Revisiting Reuse for Approximate Query Processing , 2017, Proc. VLDB Endow..

[111]  Rafail Ostrovsky,et al.  Generalizing the Layering Method of Indyk and Woodruff: Recursive Sketches for Frequency-Based Vectors on Streams , 2013, APPROX-RANDOM.