Selecting the Right Features for Bipartite-Based Text Clustering

Document datasets can be described with a bipartite graph where terms and documents are modeled as vertices on two sides respectively. Partitioning such a graph yields a co-clustering of words and documents, in the hope that the cluster topic can be captured by the top terms and documents in the same cluster. However, single terms alone are often not enough to capture the semantics of documents. To that end, in this paper, we propose to employ hyperclique patterns of terms as additional features for document representation. Then we use F-score to select the top discriminative features to construct the bipartite. Finally, the extensive experiments indicated that compared to the standard bipartite formulation, our approach is able to achieve better clustering performance at a smaller graph size.

[1]  Andrew McCallum,et al.  Distributional clustering of words for text classification , 1998, SIGIR '98.

[2]  Hui Xiong,et al.  Mining strong affinity association patterns in data sets with skewed support distribution , 2003, Third IEEE International Conference on Data Mining.

[3]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[4]  Naonori Ueda,et al.  3D-SE Viewer: A Text Mining Tool based on Bipartite Graph Visualization , 2007, 2007 International Joint Conference on Neural Networks.

[5]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[6]  George Karypis,et al.  Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering , 2004, Machine Learning.

[7]  Hui Xiong,et al.  Hyperclique pattern discovery , 2006, Data Mining and Knowledge Discovery.

[8]  R. Mooney,et al.  Impact of Similarity Measures on Web-page Clustering , 2000 .

[9]  Sam Yuan Sung,et al.  Joint Cluster Based Co-clustering for Clustering Ensembles , 2006, ADMA.

[10]  Inderjit S. Dhillon,et al.  A fast kernel-based multilevel algorithm for graph clustering , 2005, KDD '05.

[11]  Chinatsu Aone,et al.  Fast and effective text mining using linear-time document clustering , 1999, KDD '99.

[12]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[13]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[14]  Hui Xiong,et al.  A hybrid approach for mining maximal hyperclique patterns , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.