A Sampling-Based Hybrid Approximate Query Processing System in the Cloud

Sampling-based approximate query processing method provides the way, in which the users can save their time and resources for 'Big Data' analytical applications, if the estimated results can satisfy the accuracy expectation earlier before a long wait for the final accurate results. Online aggregation (OLA) is such an attractive technology to respond aggregation queries by calculating approximate results with the confidence interval getting tighter over time. It has been built into the MapReuduce-based cloud system for big data analytics, which allows users to monitor the query progress and save money by killing the computation earlier once sufficient accuracy has been obtained. Unfortunately, there exists a major obstacle that is the estimation failure of OLA affects the OLA performance, which is resulted from the biased sample set that violates the unbiased assumption of OLA sampling. To handle this problem, we first propose a hybrid approximate query processing model to improve the overall OLA performance, where a dynamic scheme switching mechanism is deliberately designed to switch unpromising OLA queries into the bootstrap scheme for further processing, avoiding the whole dataset scanning resulted from the OLA estimation failure. In addition, we also present a progressive estimation method to reduce the false positive ratio of our dynamic scheme switching mechanism. Moreover, we have implemented our hybrid approximate query processing system in Hadoop, and conducted extensive experiments on the TPC-H benchmark for skewed data distribution. Our results demonstrate that our hybrid system can produce acceptable approximate results within a time period one order of magnitude shorter compared to the original OLA over Hadoop.

[1]  Xiaofeng Meng,et al.  You can stop early with COLA: online processing of aggregate queries in the cloud , 2012, CIKM.

[2]  Fang Dong,et al.  OATS: online aggregation with two-level sharing strategy in cloud , 2014, Distributed and Parallel Databases.

[3]  M. Chavance [Jackknife and bootstrap]. , 1992, Revue d'epidemiologie et de sante publique.

[4]  Fang Dong,et al.  Partition-Based Online Aggregation with Shared Sampling in the Cloud , 2013, Journal of Computer Science and Technology.

[5]  Carlo Zaniolo,et al.  Early Accurate Results for Advanced Analytics on MapReduce , 2012, Proc. VLDB Endow..

[6]  J. Shao,et al.  The jackknife and bootstrap , 1996 .

[7]  Joos-Hendrik Böse,et al.  Beyond online aggregation: parallel and incremental data mining with online Map-Reduce , 2010, MDAC '10.

[8]  Donald Ervin Knuth,et al.  The Art of Computer Programming , 1968 .

[9]  Donald E. Knuth,et al.  The art of computer programming: sorting and searching (volume 3) , 1973 .

[10]  Rares Vernica,et al.  Hyracks: A flexible and extensible foundation for data-intensive computing , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[11]  Liang Dong,et al.  Starfish: A Self-tuning System for Big Data Analytics , 2011, CIDR.

[12]  Joseph M. Hellerstein,et al.  MapReduce Online , 2010, NSDI.

[13]  Helen J. Wang,et al.  Online aggregation , 1997, SIGMOD '97.

[14]  Jeffrey F. Naughton,et al.  A scalable hash ripple join algorithm , 2002, SIGMOD '02.

[15]  Peter J. Haas,et al.  Ripple joins for online aggregation , 1999, SIGMOD '99.

[16]  Joseph M. Hellerstein,et al.  Online aggregation and continuous query support in MapReduce , 2010, SIGMOD Conference.

[17]  Beng Chin Ooi,et al.  Continuous sampling for online aggregation over multiple queries , 2010, SIGMOD Conference.

[18]  Donald E. Knuth,et al.  The Art of Computer Programming: Volume 3: Sorting and Searching , 1998 .

[19]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[20]  TanKian-Lee,et al.  Distributed online aggregations , 2009, VLDB 2009.

[21]  Chris Jermaine,et al.  Online aggregation for large MapReduce jobs , 2011, Proc. VLDB Endow..

[22]  Fang Dong,et al.  Improving Online Aggregation Performance for Skewed Data Distribution , 2012, DASFAA.

[23]  Peter J. Haas,et al.  Large-sample and deterministic confidence intervals for online aggregation , 1997, Proceedings. Ninth International Conference on Scientific and Statistical Database Management (Cat. No.97TB100150).

[24]  David Thomas,et al.  The Art in Computer Programming , 2001 .