Random Sample Partition: A Distributed Data Model for Big Data Analysis

With the ever-increasing volume of data, alternative strategies are required to divide big data into statistically consistent data blocks that can be used directly as representative samples of the entire data set in big data analysis. In this paper, we propose the Random Sample Partition (RSP) distributed data model to represent a big data set as a set of disjoint data blocks, called RSP blocks. Each RSP block has a probability distribution similar to that of the entire data set. RSP blocks can be used to estimate the statistical properties of the data and build predictive models without computing the entire data set. We demonstrate the implications of the RSP model on sampling from big data and introduce a new RSP-based method for approximate big data analysis which can be applied to different scenarios in the industry. This method significantly reduces the computational burden of big data and increases the productivity of data scientists.

[1]  Taghi M. Khoshgoftaar,et al.  A survey of open source tools for machine learning with big data in the Hadoop ecosystem , 2015, Journal of Big Data.

[2]  Reynold Xin,et al.  Apache Spark , 2016 .

[3]  Surajit Chaudhuri,et al.  Effective use of block-level sampling in statistics estimation , 2004, SIGMOD '04.

[4]  Yang Wang,et al.  Distributed and parallel construction method for equi-width histogram in cloud database , 2017, Multiagent Grid Syst..

[5]  Junfeng Yang,et al.  Optimizing Data Partitioning for Data-Parallel Computing , 2011, HotOS.

[6]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[7]  Yu-Lin He,et al.  A Two-Stage Data Processing Algorithm to Generate Random Sample Partitions for Big Data Analysis , 2018, CLOUD.

[8]  Lior Rokach,et al.  Ensemble-based classifiers , 2010, Artificial Intelligence Review.

[9]  Ion Stoica,et al.  BlinkDB: queries with bounded errors and bounded response times on very large data , 2012, EuroSys '13.

[10]  L. Ryan,et al.  Sufficiency Revisited: Rethinking Statistical Algorithms in the Big Data Era , 2017 .

[11]  Vladimir Vlassov,et al.  Block Sampling: Efficient Accurate Online Aggregation in MapReduce , 2013, 2013 IEEE 5th International Conference on Cloud Computing Technology and Science.

[12]  Purnamrita Sarkar,et al.  A scalable bootstrap for massive data , 2011, 1112.5016.

[13]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[14]  Shraddha Phansalkar,et al.  Survey of data partitioning algorithms for big data stores , 2016, 2016 Fourth International Conference on Parallel, Distributed and Grid Computing (PDGC).

[15]  Sparsh Mittal,et al.  A Survey of Techniques for Approximate Computing , 2016, ACM Comput. Surv..

[16]  Julian J. Faraway,et al.  When small data beats big data , 2018 .

[17]  Thu D. Nguyen,et al.  ApproxHadoop: Bringing Approximations to MapReduce Frameworks , 2015, ASPLOS.

[18]  Peter J. Haas,et al.  Sampling for Scalable Visual Analytics , 2017, IEEE Computer Graphics and Applications.

[19]  Ameet Talwalkar,et al.  Knowing when you're wrong: building fast and reliable approximate query processing systems , 2014, SIGMOD Conference.

[20]  Srikanth Kandula,et al.  Quickr: Lazily Approximating Complex AdHoc Queries in BigData Clusters , 2016, SIGMOD Conference.

[21]  Purnamrita Sarkar,et al.  The Big Data Bootstrap , 2012, ICML.

[22]  R. Tibshirani,et al.  An introduction to the bootstrap , 1993 .

[23]  Fei Xu,et al.  Sampling Based Range Partition Methods for Big Data Analytics , 2012 .

[24]  Tim Kraska,et al.  A sample-and-clean framework for fast and accurate query processing on dirty data , 2014, SIGMOD Conference.

[25]  Bowei Xi,et al.  Large complex data: divide and recombine (D&R) with RHIPE , 2012 .

[26]  Bernhard Schölkopf,et al.  A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[27]  Christof Fetzer,et al.  IncApprox: A Data Analytics System for Incremental Approximate Computing , 2016, WWW.

[28]  Yu-Lin He,et al.  Empirical Analysis of Asymptotic Ensemble Learning for Big Data , 2016, 2016 IEEE/ACM 3rd International Conference on Big Data Computing Applications and Technologies (BDCAT).

[29]  Ravi Nair,et al.  Big data needs approximate computing , 2014, Commun. ACM.

[30]  Nicole A. Lazar The Big Picture: Divide and Combine to Conquer Big Data , 2018 .

[31]  Xiaofeng Meng,et al.  An Efficient Block Sampling Strategy for Online Aggregation in the Cloud , 2015, WAIM.

[32]  Joshua Zhexue Huang,et al.  Big data analytics on Apache Spark , 2016, International Journal of Data Science and Analytics.

[33]  Charu C. Aggarwal,et al.  Data Mining: The Textbook , 2015 .