Empirical Analysis of Asymptotic Ensemble Learning for Big Data

In many application areas, data that is being generated and processed goes beyond the petabyte scale. Analyzing such an increasing massive volume of data faces computational, as well as, statistical challenges. In order to solve these challenges, distributed and parallel processing frameworks have been used for implementing scalable data analysis algorithms. Nevertheless, processing the whole big data set at one time may exceed the available computing resources and the time requirements for some applications. Thus, approximate approaches can be used to achieve asymptotic analysis results, especially when data analysis algorithms are amenable to an approximate result rather than an exact one. However, most approximation approaches require taking a random sample of the data which is a nontrivial task when working with big data sets. In this paper, we employ ensemble learning as an approach for asymptotic analysis using randomly selected subsets (i.e. data blocks) of a big data set. We propose an asymptotic ensemble learning framework which depends on block-based sampling rather than record-based sampling. In order to demonstrate the feasibility and performance of this framework, we present an empirical analysis on real data sets. In addition to the scalability advantage, the experimental results show that several blocks of a data set are enough to get approximately the same results as those from using the whole data set.

[1]  Sparsh Mittal,et al.  A Survey of Techniques for Approximate Computing , 2016, ACM Comput. Surv..

[2]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[3]  D. M. Hutton,et al.  The Art of Multiprocessor Programming , 2008 .

[4]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[5]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[6]  Ion Stoica,et al.  BlinkDB: queries with bounded errors and bounded response times on very large data , 2012, EuroSys '13.

[7]  Fei Xu,et al.  Sampling Based Range Partition Methods for Big Data Analytics , 2012 .

[8]  Reynold Xin,et al.  SparkR: Scaling R Programs with Spark , 2016, SIGMOD Conference.

[9]  Christof Fetzer,et al.  IncApprox: A Data Analytics System for Incremental Approximate Computing , 2016, WWW.

[10]  Lior Rokach,et al.  Ensemble-based classifiers , 2010, Artificial Intelligence Review.

[11]  Michael I. Jordan On statistics, computation and scalability , 2013, ArXiv.

[12]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[13]  Henry Hoffmann,et al.  Managing performance vs. accuracy trade-offs with loop perforation , 2011, ESEC/FSE '11.

[14]  Charu C. Aggarwal,et al.  Data Mining: The Textbook , 2015 .

[15]  Jimmy J. Lin,et al.  Large-scale machine learning at twitter , 2012, SIGMOD Conference.

[16]  Yael Ben-Haim,et al.  A Streaming Parallel Decision Tree Algorithm , 2010, J. Mach. Learn. Res..

[17]  Maurice Herlihy,et al.  The Art of Multiprocessor Programming, Revised Reprint , 2012 .

[18]  References , 1971 .

[19]  P. Baldi,et al.  Searching for exotic particles in high-energy physics with deep learning , 2014, Nature Communications.

[20]  Thu D. Nguyen,et al.  ApproxHadoop: Bringing Approximations to MapReduce Frameworks , 2015, ASPLOS.