An integrated K-means - Laplacian cluster ensemble approach for document datasets

Cluster ensemble has become an important extension to traditional clustering algorithms, yet the cluster ensemble problem is very challenging due to the inherent difficulty in resolving the label correspondence problem. We adapted the integrated K-means - Laplacian clustering approach to solve the cluster ensemble problem by exploiting both the attribute information embedded in the cluster labels and the pairwise relations among the objects. The optimal solution of the proposed approach requires computing the pseudo inverse of the normalized Laplacian matrix and the eigenvalue decomposition of a large matrix, which can be computationally burdensome for large scale document datasets. We devised an effective algebraic transformation method for efficiently carrying out the aforementioned computations and proposed an integrated K-means - Laplacian cluster ensemble approach (IKLCEA). Experimental results with benchmark document datasets demonstrate that IKLCEA outperforms other cluster ensemble techniques on most cases. In addition, IKLCEA is computationally efficient and can be readily employed in large scale document applications.

[1]  Ulrike von Luxburg,et al.  On the Convergence of Spectral Clustering on Random Samples: The Normalized Case , 2004, COLT.

[2]  Ana L. N. Fred,et al.  Combining multiple clusterings using evidence accumulation , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Carla E. Brodley,et al.  Solving cluster ensemble problems by bipartite graph partitioning , 2004, ICML.

[4]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[5]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[6]  Xiaoli Z. Fern,et al.  Cluster Ensemble Selection , 2008 .

[7]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[8]  Mohamed S. Kamel,et al.  Cumulative Voting Consensus Method for Partitions with Variable Number of Clusters , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Marimuthu Palaniswami,et al.  Fuzzy c-Means Algorithms for Very Large Data , 2012, IEEE Transactions on Fuzzy Systems.

[10]  Zhaohong Deng,et al.  Enhanced soft subspace clustering integrating within-cluster and between-cluster information , 2010, Pattern Recognit..

[11]  Tossapon Boongoen,et al.  Refining Pairwise Similarity Matrix for Cluster Ensemble Problem with Cluster Relations , 2008, Discovery Science.

[12]  Zhaohong Deng,et al.  A survey on soft subspace clustering , 2014, Inf. Sci..

[13]  Zhaohong Deng,et al.  Transfer Prototype-Based Fuzzy Clustering , 2014, IEEE Transactions on Fuzzy Systems.

[14]  Wenjun Zhou,et al.  Spectral clustering of high-dimensional data exploiting sparse representation vectors , 2014, Neurocomputing.

[15]  Ana L. N. Fred,et al.  Data clustering using evidence accumulation , 2002, Object recognition supported by user interaction for service robots.

[16]  Mikhail Belkin,et al.  Consistency of spectral clustering , 2008, 0804.0678.

[17]  Inderjit S. Dhillon,et al.  Efficient Clustering of Very Large Document Collections , 2001 .

[18]  T. Kohonen Self-organized formation of topographically correct feature maps , 1982 .

[19]  Joachim M. Buhmann,et al.  Bagging for Path-Based Clustering , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[20]  Joan Claudi Socoró,et al.  BordaConsensus: a new consensus function for soft cluster ensembles , 2007, SIGIR.

[21]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[22]  Mohammad Hossein Fazel Zarandi,et al.  A new cluster validity measure based on general type-2 fuzzy sets: Application in gene expression data clustering , 2014, Knowl. Based Syst..

[23]  Rich Caruana,et al.  Consensus Clusterings , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[24]  Jitendra Malik,et al.  Normalized Cuts and Image Segmentation , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[25]  Zijiang Yang,et al.  A Fuzzy Subspace Algorithm for Clustering High Dimensional Data , 2006, ADMA.

[26]  Adrian Bowman,et al.  Interactive Teaching Tools for Spatial Sampling , 2010 .

[27]  Natthakan Iam-On,et al.  LinkCluE: A MATLAB Package for Link-Based Cluster Ensembles , 2010 .

[28]  Teuvo Kohonen,et al.  Self-organized formation of topologically correct feature maps , 2004, Biological Cybernetics.

[29]  Xiaohui Liu,et al.  Consensus clustering and functional interpretation of gene-expression data , 2004, Genome Biology.

[30]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Multi-objective clustering ensemble for gene expression data analysis , 2009, Neurocomputing.

[31]  Anil K. Jain Data clustering: 50 years beyond K-means , 2010, Pattern Recognit. Lett..

[32]  Jianhong Wu,et al.  A convergence theorem for the fuzzy subspace clustering (FSC) algorithm , 2008, Pattern Recognit..

[33]  W. T. Tucker,et al.  Convergence theory for fuzzy c-means: Counterexamples and repairs , 1987, IEEE Transactions on Systems, Man, and Cybernetics.

[34]  D. Pollard Strong Consistency of $K$-Means Clustering , 1981 .

[35]  Tossapon Boongoen,et al.  A Link-Based Cluster Ensemble Approach for Categorical Data Clustering , 2012, IEEE Transactions on Knowledge and Data Engineering.

[36]  Michael W. Berry,et al.  Large-Scale Sparse Singular Value Computations , 1992 .

[37]  Fei Wang,et al.  Integrated KL (K-means - Laplacian) Clustering: A New Clustering Approach by Combining Attribute Data and Pairwise Relations , 2009, SDM.

[38]  Claudio Carpineto,et al.  Consensus Clustering Based on a New Probabilistic Rand Index with Application to Subtopic Retrieval , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[40]  Shi Zhong,et al.  A Comparative Study of Generative Models for Document Clustering , 2003 .

[41]  William F. Punch,et al.  A Comparison of Resampling Methods for Clustering Ensembles , 2004, IC-AI.

[42]  Anil K. Jain,et al.  A Mixture Model for Clustering Ensembles , 2004, SDM.

[43]  Chih-Fong Tsai,et al.  Cluster ensembles in collaborative filtering recommendation , 2012, Appl. Soft Comput..

[44]  Tossapon Boongoen,et al.  LCE: a link-based cluster ensemble method for improved gene expression data analysis , 2010, Bioinform..

[45]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[46]  Pengjiang Qian,et al.  Collaborative Fuzzy Clustering From Multiple Weighted Views , 2015, IEEE Transactions on Cybernetics.

[47]  Wei Tang,et al.  Clusterer ensemble , 2006, Knowl. Based Syst..

[48]  Jill P. Mesirov,et al.  Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data , 2003, Machine Learning.

[49]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[50]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[51]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[52]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Cluster ensemble selection based on relative validity indexes , 2012, Data Mining and Knowledge Discovery.

[53]  Carla E. Brodley,et al.  Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach , 2003, ICML.

[54]  Arindam Banerjee,et al.  Bayesian cluster ensembles , 2011, Stat. Anal. Data Min..

[55]  Anil K. Jain,et al.  Clustering ensembles: models of consensus and weak partitions , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[56]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[57]  Mohamed S. Kamel,et al.  On voting-based consensus of cluster ensembles , 2010, Pattern Recognit..