The Heterogeneous Cluster Ensemble Method Using Hubness for Clustering Text Documents

We propose a cluster ensemble method to map the corpus documents into the semantic space embedded in Wikipedia and group them using multiple types of feature space. A heterogeneous cluster ensemble is constructed with multiple types of relations i.e. document-term, document-concept and document-category. A final clustering solution is obtained by exploiting associations between document pairs and hubness of the documents. Empirical analysis with various real data sets reveals that the proposed method outperforms state-of-the-art text clustering approaches.

[1]  Jian Yu,et al.  High-Order Co-clustering Text Data on Semantics-Based Representation Model , 2011, PAKDD.

[2]  Ian H. Witten,et al.  Topic indexing with Wikipedia , 2008 .

[3]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[4]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[5]  Ian H. Witten,et al.  Clustering Documents Using a Wikipedia-Based Concept Representation , 2009, PAKDD.

[6]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[7]  Xiaohua Hu,et al.  Exploiting Wikipedia as external knowledge for document clustering , 2009, KDD.

[8]  Dunja Mladenic,et al.  The Role of Hubness in Clustering High-Dimensional Data , 2011, IEEE Transactions on Knowledge and Data Engineering.

[9]  Sandro Vega-Pons,et al.  Weighted association based methods for the combination of heterogeneous partitions , 2011, Pattern Recognit. Lett..

[10]  Anil K. Jain,et al.  A Mixture Model for Clustering Ensembles , 2004, SDM.

[11]  Ana L. N. Fred,et al.  Finding Consistent Clusters in Data Partitions , 2001, Multiple Classifier Systems.

[12]  Wolf-Tilo Balke,et al.  Using Wikipedia categories for compact representations of chemical documents , 2010, CIKM '10.

[13]  K JainAnil,et al.  Combining Multiple Clusterings Using Evidence Accumulation , 2005 .

[14]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[15]  Jon Atli Benediktsson,et al.  Multiple Classifier Systems , 2015, Lecture Notes in Computer Science.

[16]  Ana L. N. Fred,et al.  Combining multiple clusterings using evidence accumulation , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..