Very fast estimation for result and accuracy of big data analytics: The EARL system

Approximate results based on samples often provide the only way in which advanced analytical applications on very massive data sets (a.k.a. `big data') can satisfy their time and resource constraints. Unfortunately, methods and tools for the computation of accurate early results are currently not supported in big data systems (e.g., Hadoop). Therefore, we propose a nonparametric accuracy estimation method and system to speedup big data analytics. Our framework is called EARL (Early Accurate Result Library) and it works by predicting the learning curve and choosing the appropriate sample size for achieving the desired error bound specified by the user. The error estimates are based on a technique called bootstrapping that has been widely used and validated by statisticians, and can be applied to arbitrary functions and data distributions. Therefore, this demo will elucidate (a) the functionality of EARL and its intuitive GUI interface whereby first-time users can appreciate the accuracy obtainable from increasing sample sizes by simply viewing the learning curve displayed by EARL, (b) the usability of EARL, whereby conference participants can interact with the system to quickly estimate the sample sizes needed to obtain the desired accuracies or response times, and then compare them against the accuracies and response times obtained in the actual computations.

[1]  Rob Pike,et al.  Interpreting the data: Parallel analysis with Sawzall , 2005, Sci. Program..

[2]  Chris Jermaine,et al.  Online aggregation for large MapReduce jobs , 2011, Proc. VLDB Endow..

[3]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[4]  Carlo Zaniolo,et al.  Early Accurate Results for Advanced Analytics on MapReduce , 2012, Proc. VLDB Endow..

[5]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[6]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[7]  Michael J. Carey,et al.  Extending Map-Reduce for Efficient Predicate-Based Sampling , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[8]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[9]  Sayan Mukherjee,et al.  Estimating Dataset Size Requirements for Classifying DNA Microarray Data , 2003, J. Comput. Biol..

[10]  M. Chavance [Jackknife and bootstrap]. , 1992, Revue d'epidemiologie et de sante publique.

[11]  B. Efron Better Bootstrap Confidence Intervals , 1987 .

[12]  Liang Dong,et al.  Starfish: A Self-tuning System for Big Data Analytics , 2011, CIDR.

[13]  Michael Mitzenmacher,et al.  Detecting Novel Associations in Large Data Sets , 2011, Science.

[14]  Doron Rotem,et al.  Random Sampling from Database Files: A Survey , 1990, SSDBM.

[15]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .