Queries with Bounded Errors & Bounded Response Times on Very Large Data

Author(s): Agarwal, Sameer | Advisor(s): Stoica, Ion | Abstract: Modern data analytics applications typically process massive amounts of data on clusters of tens, hundreds, or thousands of machines to support near-real-time decisions. The quantity of data and limitations of disk and memory bandwidth often make it infeasible to deliver answers at human-interactive speeds. However, it has been widely observed that many applications can tolerate some degree of inaccuracy. This is especially true for exploratory queries on data, where users are satisfied with "close-enough" answers if they can be provided quickly to the end user. A popular technique for speeding up queries at the cost of accuracy is to execute each query on a sample of data, rather than the whole dataset. In this thesis, we present BlinkDB, a massively parallel, approximate query engine for running interactive SQL queries on large volumes of data. BlinkDB allows users to trade-off query accuracy for response time, enabling interactive queries over massive data by running queries on data samples and presenting results annotated with meaningful error bars. To achieve this, BlinkDB uses three key ideas: (1) an adaptive optimization framework that builds and maintains a set of multi-dimensional stratified samples from original data over time, (2) a dynamic sample selection strategy that selects an appropriately sized sample based on a query's accuracy or response time requirements, and (3) an error estimation and diagnostics module that produces approximate answers and reliable error bars. We evaluate BlinkDB extensively against well-known database benchmarks and a number of real-world analytic workloads showing that it is possible to implement an end-to-end query approximation pipeline that produces approximate answers with reliable error bars at interactive speeds.

[1]  Parag Agrawal,et al.  Scheduling shared scans of large data files , 2008, Proc. VLDB Endow..

[2]  Jeffrey Scott Vitter,et al.  Approximate computation of multidimensional aggregates of sparse data using wavelets , 1999, SIGMOD '99.

[3]  Jae-Gil Lee,et al.  Sampling cube: a framework for statistical olap over sampling data , 2008, SIGMOD Conference.

[4]  Steven Hand,et al.  CIEL: A Universal Execution Engine for Distributed Data-Flow Computing , 2011, NSDI.

[5]  Ameet Talwalkar,et al.  A general bootstrap performance diagnostic , 2013, KDD.

[6]  Surajit Chaudhuri,et al.  Effective use of block-level sampling in statistics estimation , 2004, SIGMOD '04.

[7]  Albert G. Greenberg,et al.  Reining in the Outliers in Map-Reduce Clusters using Mantri , 2010, OSDI.

[8]  Joseph M. Hellerstein,et al.  Eddies: continuously adaptive query processing , 2000, SIGMOD '00.

[9]  Carlo Zaniolo,et al.  Early Accurate Results for Advanced Analytics on MapReduce , 2012, Proc. VLDB Endow..

[10]  Francesco Buccafurri,et al.  Enhancing histograms by tree-like bucket indices , 2007, The VLDB Journal.

[11]  J. Rice Mathematical Statistics and Data Analysis , 1988 .

[12]  Florin Rusu,et al.  PF-OLA: a high-performance framework for parallel online aggregation , 2012, Distributed and Parallel Databases.

[13]  Viswanath Poosala,et al.  Congressional samples for approximate answering of group-by queries , 2000, SIGMOD '00.

[14]  Robert D. Tortora,et al.  Sampling: Design and Analysis , 2000 .

[15]  Rajeev Motwani,et al.  Sampling from a moving window over streaming data , 2002, SODA '02.

[16]  Srikanth Kandula,et al.  Reoptimizing Data Parallel Computing , 2012, NSDI.

[17]  Michael I. Jordan,et al.  Managing data transfers in computer clusters with orchestra , 2011, SIGCOMM 2011.

[18]  Stuart J. Russell,et al.  Online bagging and boosting , 2005, 2005 IEEE International Conference on Systems, Man and Cybernetics.

[19]  Chris Jermaine,et al.  Online aggregation for large MapReduce jobs , 2011, Proc. VLDB Endow..

[20]  Ion Stoica,et al.  BlinkDB: queries with bounded errors and bounded response times on very large data , 2012, EuroSys '13.

[21]  Carlo Zaniolo,et al.  The analytical bootstrap: a new method for fast error estimation in approximate query processing , 2014, SIGMOD Conference.

[22]  Helen J. Wang,et al.  Online aggregation , 1997, SIGMOD '97.

[23]  Scott Shenker,et al.  Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling , 2010, EuroSys '10.

[24]  P. Hall On Symmetric Bootstrap Confidence Intervals , 1988 .

[25]  Carlo Zaniolo,et al.  Optimal load shedding with aggregates and mining queries , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[26]  Scott Shenker,et al.  Shark: SQL and rich analytics at scale , 2012, SIGMOD '13.

[27]  Andrey Gubarev,et al.  Dremel : Interactive Analysis of Web-Scale Datasets , 2011 .

[28]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[29]  Peter J. Haas,et al.  Hoeffding inequalities for join-selectivity estimation and online aggregation , 1996 .

[30]  Angelo J. Canty,et al.  Bootstrap diagnostics and remedies , 2006 .

[31]  Albert G. Greenberg,et al.  Sharing the Data Center Network , 2011, NSDI.

[32]  Alexandr Andoni,et al.  Streaming Algorithms via Precision Sampling , 2010, 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science.

[33]  Gustavo Alonso,et al.  SharedDB: Killing One Thousand Queries With One Stone , 2012, Proc. VLDB Endow..

[34]  Srikanth Kandula,et al.  Recurring job optimization in scope , 2012, SIGMOD Conference.

[35]  David J. DeWitt,et al.  Efficient mid-query re-optimization of sub-optimal query execution plans , 1998, SIGMOD '98.

[36]  Graham Cormode,et al.  Structure-aware sampling on data streams , 2011, SIGMETRICS '11.

[37]  Minos N. Garofalakis,et al.  Approximate Query Processing: Taming the TeraBytes , 2001, VLDB.

[38]  Liang Dong,et al.  Starfish: A Self-tuning System for Big Data Analytics , 2011, CIDR.

[39]  Joseph M. Hellerstein,et al.  MapReduce Online , 2010, NSDI.

[40]  Luis Gravano,et al.  STHoles: a multidimensional workload-aware histogram , 2001, SIGMOD '01.

[41]  Xiaodan Wang,et al.  CoScan: cooperative scan sharing in the cloud , 2011, SoCC.

[42]  Kyuseok Shim,et al.  Approximate query processing using wavelets , 2001, The VLDB Journal.

[43]  Martin L. Kersten,et al.  SciBORQ: Scientific data management with Bounds On Runtime and Quality , 2011, CIDR.

[44]  Francesco Buccafurri,et al.  Binary-Tree Histograms with Tree Indices , 2002, DEXA.

[45]  M. Habib Probabilistic methods for algorithmic discrete mathematics , 1998 .

[46]  Viswanath Poosala,et al.  Aqua: A Fast Decision Support Systems Using Approximate Query Answers , 1999, VLDB.

[47]  Sridhar Ramaswamy,et al.  The Aqua approximate query answering system , 1999, SIGMOD '99.

[48]  Albert G. Greenberg,et al.  Scarlett: coping with skewed content popularity in mapreduce clusters , 2011, EuroSys '11.

[49]  Jon A. Wellner,et al.  Weak Convergence and Empirical Processes: With Applications to Statistics , 1996 .

[50]  Gennady Antoshenkov,et al.  Random Sampling from Pseudo-Ranked B+ Trees , 1992, VLDB.

[51]  Randy H. Katz,et al.  Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[52]  Michael H. Böhlen,et al.  Multi-dimensional Histograms with Tight Bounds for the Error , 2006, 2006 10th International Database Engineering and Applications Symposium (IDEAS'06).

[53]  Robert B. Miller,et al.  Response time in man-computer conversational transactions , 1899, AFIPS Fall Joint Computing Conference.

[54]  Surajit Chaudhuri,et al.  An overview of data warehousing and OLAP technology , 1997, SGMD.

[55]  Chris Jermaine,et al.  Relational confidence bounds are easy with the bootstrap , 2005, SIGMOD '05.

[56]  Surajit Chaudhuri,et al.  Dynamic sample selection for approximate query processing , 2003, SIGMOD '03.

[57]  Alexandr Andoni,et al.  Streaming Algorithms from Precision Sampling , 2010, ArXiv.

[58]  Charu C. Aggarwal,et al.  On biased reservoir sampling in the presence of stream evolution , 2006, VLDB.

[59]  Carsten Sapia,et al.  PROMISE: Predicting Query Behavior to Enable Predictive Caching Strategies for OLAP Systems , 2000, DaWaK.

[60]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[61]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[62]  Nisheeth Shrivastava,et al.  Space Efficient Streaming Algorithms for the Maximum Error Histogram , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[63]  Surajit Chaudhuri,et al.  Optimized stratified sampling for approximate query processing , 2007, TODS.

[64]  Ion Stoica,et al.  Blink and It's Done: Interactive Queries on Very Large Data , 2012, Proc. VLDB Endow..

[65]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[66]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[67]  Ameet Talwalkar,et al.  Knowing when you're wrong: building fast and reliable approximate query processing systems , 2014, SIGMOD Conference.

[68]  Christopher Olston,et al.  Interactive Analysis of Web-Scale Data , 2009, CIDR.

[69]  Doron Rotem,et al.  Simple Random Sampling from Relational Databases , 1986, VLDB.

[70]  Edward Y. Chang,et al.  Data management projects at Google , 2008, SGMD.

[71]  Fei Xu,et al.  Turbo-Charging Estimate Convergence in DBO , 2009, Proc. VLDB Endow..