Uncertain Data Clustering in Distributed Peer-to-Peer Networks

Uncertain data clustering has been recognized as an essential task in the research of data mining. Many centralized clustering algorithms are extended by defining new distance or similarity measurements to tackle this issue. With the fast development of network applications, these centralized methods show their limitations in conducting data clustering in a large dynamic distributed peer-to-peer network due to the privacy and security concerns or the technical constraints brought by distributive environments. In this paper, we propose a novel distributed uncertain data clustering algorithm, in which the centralized global clustering solution is approximated by performing distributed clustering. To shorten the execution time, the reduction technique is then applied to transform the proposed method into its deterministic form by replacing each uncertain data object with its expected centroid. Finally, the attribute-weight-entropy regularization technique enhances the proposed distributed clustering method to achieve better results in data clustering and extract the essential features for cluster identification. The experiments on both synthetic and real-world data have shown the efficiency and superiority of the presented algorithm.

[1]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[2]  Edward Hung,et al.  An Efficient Distance Calculation Method for Uncertain Objects , 2007, 2007 IEEE Symposium on Computational Intelligence and Data Mining.

[3]  Khaled M. Hammouda Distributed Document Clustering and Cluster Summarization in Peer-to-Peer Environments , 2007 .

[4]  D. Pollard Strong Consistency of $K$-Means Clustering , 1981 .

[5]  Michael K. Ng,et al.  An Entropy Weighting k-Means Algorithm for Subspace Clustering of High-Dimensional Sparse Data , 2007, IEEE Transactions on Knowledge and Data Engineering.

[6]  Michael K. Ng,et al.  Agglomerative Fuzzy K-Means Clustering Algorithm with Selection of Number of Clusters , 2008, IEEE Transactions on Knowledge and Data Engineering.

[7]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[8]  Bin Jiang,et al.  Clustering Uncertain Data Based on Probability Distribution Similarity , 2013, IEEE Transactions on Knowledge and Data Engineering.

[9]  Yuan Zhang,et al.  Fuzzy clustering with the entropy of attribute weights , 2016, Neurocomputing.

[10]  Dipti Verma,et al.  Data Mining: Next Generation Challenges and Future Directions , 2012 .

[11]  Hillol Kargupta,et al.  K-Means Clustering Over a Large, Dynamic Network , 2006, SDM.

[12]  Inderjit S. Dhillon,et al.  Clustering with Bregman Divergences , 2005, J. Mach. Learn. Res..

[13]  Yuehui Chen,et al.  Improving Neural-Network Classifiers Using Nearest Neighbor Partitioning , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[14]  Hans-Peter Kriegel,et al.  Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering , 2009, TKDD.

[15]  Hans-Peter Kriegel,et al.  Density-based clustering of uncertain data , 2005, KDD '05.

[16]  Reynold Cheng,et al.  Reducing UK-Means to K-Means , 2007, Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007).

[17]  Charu C. Aggarwal,et al.  Data Clustering: Algorithms and Applications , 2014 .

[18]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[19]  Yung-Yu Chuang,et al.  Multiple Kernel Fuzzy Clustering , 2012, IEEE Transactions on Fuzzy Systems.

[20]  A. Asuncion,et al.  UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences , 2007 .

[21]  Jennifer Widom,et al.  Representing uncertain data: models, properties, and algorithms , 2009, The VLDB Journal.

[22]  C. Priebe,et al.  Perfect Clustering for Stochastic Blockmodel Graphs via Adjacency Spectral Embedding , 2013, 1310.0532.

[23]  Bin Jiang,et al.  Probabilistic skylines on uncertain data: model and bounding-pruning-refining methods , 2010, Journal of Intelligent Information Systems.

[24]  C. L. Philip Chen,et al.  A Collaborative Fuzzy Clustering Algorithm in Distributed Network Environments , 2014, IEEE Transactions on Fuzzy Systems.

[25]  C. D. Kemp,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[26]  Long Chen,et al.  Kernel Spatial Shadowed C-Means for Image Segmentation , 2014 .

[27]  D. W. Scott,et al.  Multivariate Density Estimation, Theory, Practice and Visualization , 1992 .

[28]  Witold Pedrycz,et al.  Collaborative fuzzy clustering , 2002, Pattern Recognit. Lett..

[29]  Reynold Cheng,et al.  Uncertain Data Mining: An Example in Clustering Location Data , 2006, PAKDD.

[30]  C. L. Philip Chen,et al.  Cluster number selection for a small set of samples using the Bayesian Ying-Yang model , 2002, IEEE Trans. Neural Networks.

[31]  Long Chen,et al.  Clustering Algorithm Based on Spatial Shadowed Fuzzy C-means and I-Ching Operators , 2016, Int. J. Fuzzy Syst..

[32]  James C. Bezdek,et al.  A Convergence Theorem for the Fuzzy ISODATA Clustering Algorithms , 1980, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Sunil Prabhakar,et al.  Querying imprecise data in moving object environments , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[34]  C. L. Philip Chen,et al.  A Multiple-Kernel Fuzzy C-Means Algorithm for Image Segmentation , 2011, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[35]  Georgios B. Giannakis,et al.  Distributed Clustering Using Wireless Sensor Networks , 2011, IEEE Journal of Selected Topics in Signal Processing.

[36]  Hans-Peter Kriegel,et al.  Hierarchical density-based clustering of uncertain data , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[37]  Witold Pedrycz,et al.  Kernel-based fuzzy clustering and fuzzy clustering: A comparative experimental study , 2010, Fuzzy Sets Syst..

[38]  L. Hubert,et al.  Comparing partitions , 1985 .

[39]  Witold Pedrycz,et al.  Collaborative clustering with the use of Fuzzy C-Means and its quantification , 2008, Fuzzy Sets Syst..

[40]  Reynold Cheng,et al.  Efficient Clustering of Uncertain Data , 2006, Sixth International Conference on Data Mining (ICDM'06).

[41]  W. Bruce Croft,et al.  Cluster-based retrieval using language models , 2004, SIGIR '04.