Big Data Clustering Based on Summary Statistics

Big Data are expanding fast and widely researched and used in many domains. One of the largest challenges in data mining is how to cluster the Big Data efficiently. CF-tree is the original of many big data clustering algorithms, however some shortcomings are exist. This paper proposes an algorithm named Clustering based On the Summary Statistics (COSS). We first analyzes the shortcomings of the traditional approaches for constructing CF-tree with constant radius thresholds T for the micro clusters in detail and proposed a dynamic adaptive threshold setting mechanism. Having got all the micro clusters, a proper clustering algorithm is used to get the final clustering algorithm. We include a performance study demonstrating that the improved CF-tree is more space efficient and the clustering results are more refined.

[1]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[2]  Elisa Bertino,et al.  Advances in Database Technology - EDBT 2004 , 2004, Lecture Notes in Computer Science.

[3]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[4]  Aoying Zhou,et al.  Density-Based Clustering over an Evolving Data Stream with Noise , 2006, SDM.

[5]  Charu C. Aggarwal,et al.  Data Clustering: Algorithms and Applications , 2014 .

[6]  Iman Saleh,et al.  Social-Network-Sourced Big Data Analytics , 2013, IEEE Internet Computing.

[7]  Jimeng Sun,et al.  DisCo: Distributed Co-clustering with Map-Reduce: A Case Study towards Petabyte-Scale End-to-End Mining , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[8]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[9]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[10]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[11]  Xindong Wu,et al.  Data mining with big data , 2014, IEEE Transactions on Knowledge and Data Engineering.

[12]  Jiawei Han,et al.  CLARANS: A Method for Clustering Objects for Spatial Data Mining , 2002, IEEE Trans. Knowl. Data Eng..

[13]  Hans-Peter Kriegel,et al.  DBDC: Density Based Distributed Clustering , 2004, EDBT.

[14]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.