论文信息 - Efficient Distributed Data Clustering on Spark

Efficient Distributed Data Clustering on Spark

Data clustering is usually time-consuming since it by default needs to iteratively aggregate and process large volume of data. Approximate aggregation based on sample provides fast and quality ensured results. In this paper, we propose to leverage approximation techniques to data clustering to obtain the trade-off between clustering efficiency and result quality, along with online accuracy estimation. The proposed method is based on the bootstrap trials. We implemented this method as an Intelligent Bootstrap Library (IBL) on Spark to support efficient data clustering. Intensive evaluations show that IBL can provide a 2x speed-up over the state of art solution with the same error bound.

[1] Zheng Zhang,et al. Error-bounded Sampling for Analytics on Big Sparse Data , 2014, Proc. VLDB Endow..

[2] Carlo Zaniolo,et al. Early Accurate Results for Advanced Analytics on MapReduce , 2012, Proc. VLDB Endow..

[3] Michael J. Franklin,et al. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[4] W. Hoeffding. Probability Inequalities for sums of Bounded Random Variables , 1963 .

[5] J. MacKinnon,et al. Bootstrap tests: how many bootstraps? , 2000 .

[6] Pavel Berkhin,et al. A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.