A Two-Stage Data Processing Algorithm to Generate Random Sample Partitions for Big Data Analysis

To enable the individual data block files of a distributed big data set to be used as random samples for big data analysis, a two-stage data processing (TSDP) algorithm is proposed in this paper to convert a big data set into a random sample partition (RSP) representation which ensures that each individual data block in the RSP is a random sample of the big data, therefore, it can be used to estimate the statistical properties of the big data. The first stage of this algorithm is to sequentially chunk the big data set into non-overlapping subsets and distribute these subsets as data block files to the nodes of a cluster. The second stage is to take a random sample from each subset without replacement to form a new subset saved as an RSP data block file and the random sampling step is repeated until all data records in all subsets are used up and a new set of RSP data block files are created to form an RSP of the big data. It is formally proved that the expectation of the sample distribution function (s.d.f.) of each RSP data block equals to the s.d.f. of the big data set, therefore, each RSP data block is a random sample of the big data set. Implementation of the TSDP algorithm on Apache Spark and HDFS is presented. Performance evaluations on terabyte data sets show the efficiency of this algorithm in converting HDFS big data files into HDFS RSP big data files. We also show an example that uses only a small number of RSP data blocks to build ensemble models which perform better than the single model built from the entire data set.

[1]  David García,et al.  Estimating the expected value of fuzzy random variables in the stratified random sampling from finite populations , 2001, Inf. Sci..

[2]  Wenbo Zhang,et al.  Improved K-Means cluster algorithm in telecommunications enterprises customer segmentation , 2010, 2010 IEEE International Conference on Information Theory and Information Security.

[3]  Michaël Boireau,et al.  Uncovering Online Political Communities of Belgian MPs through Social Network Clustering Analysis , 2015, EGOSE.

[4]  Graham Cormode,et al.  Sampling for big data: a tutorial , 2014, KDD.

[5]  Gianluigi Zanetti,et al.  Pydoop: a Python MapReduce and HDFS API for Hadoop , 2010, HPDC '10.

[6]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[7]  Wu-chun Feng,et al.  Enhancing MapReduce via Asynchronous Data Processing , 2010, 2010 IEEE 16th International Conference on Parallel and Distributed Systems.

[8]  Awais Ahmad,et al.  An efficient divide-and-conquer approach for big data analytics in machine-to-machine communication , 2016, Neurocomputing.

[9]  Yu-Lin He,et al.  Empirical Analysis of Asymptotic Ensemble Learning for Big Data , 2016, 2016 IEEE/ACM 3rd International Conference on Big Data Computing Applications and Technologies (BDCAT).

[10]  Joshua Zhexue Huang,et al.  Big data analytics on Apache Spark , 2016, International Journal of Data Science and Analytics.

[11]  M. C. Jones,et al.  A reliable data-based bandwidth selection method for kernel density estimation , 1991 .

[12]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[13]  Han Liu,et al.  Challenges of Big Data Analysis. , 2013, National science review.