Efficient Clustering Aggregation Based on Data Fragments

Clustering aggregation, known as clustering ensembles, has emerged as a powerful technique for combining different clustering results to obtain a single better clustering. Existing clustering aggregation algorithms are applied directly to data points, in what is referred to as the point-based approach. The algorithms are inefficient if the number of data points is large. We define an efficient approach for clustering aggregation based on data fragments. In this fragment-based approach, a data fragment is any subset of the data that is not split by any of the clustering results. To establish the theoretical bases of the proposed approach, we prove that clustering aggregation can be performed directly on data fragments under two widely used goodness measures for clustering aggregation taken from the literature. Three new clustering aggregation algorithms are described. The experimental results obtained using several public data sets show that the new algorithms have lower computational complexity than three well-known existing point-based clustering aggregation algorithms (Agglomerative, Furthest, and LocalSearch); nevertheless, the new algorithms do not sacrifice the accuracy.

[1]  Wei Tang,et al.  Clusterer ensemble , 2006, Knowl. Based Syst..

[2]  Joachim M. Buhmann,et al.  Combining partitions by probabilistic label aggregation , 2005, KDD '05.

[3]  Ernest Valveny,et al.  Optimal Classifier Fusion in a Non-Bayesian Probabilistic Framework , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Mohamed S. Kamel,et al.  Cumulative Voting Consensus Method for Partitions with Variable Number of Clusters , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Joan Claudi Socoró,et al.  BordaConsensus: a new consensus function for soft cluster ensembles , 2007, SIGIR.

[6]  Ludmila I. Kuncheva,et al.  Switching between selection and fusion in combining classifiers: an experiment , 2002, IEEE Trans. Syst. Man Cybern. Part B.

[7]  Nir Ailon,et al.  Aggregating inconsistent information: Ranking and clustering , 2008 .

[8]  Carla E. Brodley,et al.  Solving cluster ensemble problems by bipartite graph partitioning , 2004, ICML.

[9]  Camillo Gentile,et al.  An improved Voronoi-diagram based neural net for pattern classification , 1999, IJCNN'99. International Joint Conference on Neural Networks. Proceedings (Cat. No.99CH36339).

[10]  Hui Xiong,et al.  A Data Distribution View of Clustering Algorithms , 2009, Encyclopedia of Data Warehousing and Mining.

[11]  Xiaohua Hu,et al.  Cluster Ensemble and Its Applications in Gene Expression Analysis , 2004, APBC.

[12]  David B. Shmoys,et al.  A Best Possible Heuristic for the k-Center Problem , 1985, Math. Oper. Res..

[13]  Hongjun Wang,et al.  Weighted Latent Dirichlet Allocation for Cluster Ensemble , 2008, 2008 Second International Conference on Genetic and Evolutionary Computing.

[14]  Minoru Sasaki,et al.  Ensemble document clustering using weighted hypergraph generated by NMF , 2007, ACL.

[15]  David R. Cox,et al.  The statistical analysis of series of events , 1966 .

[16]  Carla E. Brodley,et al.  Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach , 2003, ICML.

[17]  Yuchou Chang,et al.  Unsupervised feature selection using clustering ensembles and population based incremental learning algorithm , 2008, Pattern Recognit..

[18]  Aristides Gionis,et al.  Clustering aggregation , 2005, 21st International Conference on Data Engineering (ICDE'05).

[19]  S. T. Sarasamma,et al.  Hierarchical Kohonenen net for anomaly detection in network security , 2005, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[20]  Youssef Hadi,et al.  Video summarization by k-medoid clustering , 2006, SAC '06.

[21]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[22]  Anil K. Jain,et al.  Clustering ensembles: models of consensus and weak partitions , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Rich Caruana,et al.  Consensus Clusterings , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[24]  Henri Maître,et al.  A Method of Clustering Combination Applied to Satellite Image Analysis , 2007, 14th International Conference on Image Analysis and Processing (ICIAP 2007).

[25]  Joydeep Ghosh,et al.  Cluster Ensembles A Knowledge Reuse Framework for Combining Partitionings , 2002, AAAI/IAAI.

[26]  Youssef Amghar,et al.  Minimization of the Disagreements in Clustering Aggregation , 2008, ICIC.

[27]  Ana L. N. Fred,et al.  Robust data clustering , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[28]  Fei Wang,et al.  Generalized Cluster Aggregation , 2009, IJCAI.

[29]  Kagan Tumer,et al.  Ensemble clustering with voting active clusters , 2008, Pattern Recognit. Lett..

[30]  Chris H. Q. Ding,et al.  Weighted Consensus Clustering , 2008, SDM.

[31]  James Saunderson,et al.  A Local-Search 2-Approximation for 2-Correlation-Clustering , 2008, ESA.

[32]  Joan Claudi Socoró,et al.  Feature diversity in cluster ensembles for robust document clustering , 2006, SIGIR '06.

[33]  Thomas Hofmann,et al.  Non-redundant clustering with conditional ensembles , 2005, KDD '05.

[34]  R. Glynn,et al.  Incorporation of Clustering Effects for the Wilcoxon Rank Sum Test: A Large‐Sample Approach , 2003, Biometrics.

[35]  Joseph P. Romano,et al.  The stationary bootstrap , 1994 .

[36]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[37]  Hamidah Ibrahim,et al.  A Survey: Clustering Ensembles Techniques , 2009 .

[38]  Marina Meila,et al.  Comparing clusterings: an axiomatic view , 2005, ICML.

[39]  Sam Yuan Sung,et al.  Consensus clustering , 2005, Intell. Data Anal..

[40]  Vikas Singh,et al.  Ensemble Clustering using Semidefinite Programming , 2007, NIPS.

[41]  Zhi-Hua Zhou,et al.  Multi-instance clustering with applications to multi-instance prediction , 2009, Applied Intelligence.

[42]  Vikas Singh,et al.  Ensemble clustering using semidefinite programming with applications , 2010, Machine Learning.

[43]  Yun Yang,et al.  Temporal Data Clustering via Weighted Clustering Ensemble with Different Representations , 2011, IEEE Transactions on Knowledge and Data Engineering.

[44]  Ana L. N. Fred,et al.  Analysis of consensus partition in cluster ensemble , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[45]  M. Narasimha Murty,et al.  Genetic K-means algorithm , 1999, IEEE Trans. Syst. Man Cybern. Part B.

[46]  Fang Liu,et al.  Spectral Clustering Ensemble Applied to SAR Image Segmentation , 2008, IEEE Transactions on Geoscience and Remote Sensing.

[47]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..