Document Clustering Using Semantic Kernels Based on Term-Term Correlations

Document clustering algorithms usually use vector space model (VSM) as their underlying model for document representation. VSM assumes that terms are independent and accordingly ignores any semantic relations between them. This results in mapping documents to a space where the proximity between document vectors does not reflect their true semantic similarity. In this paper, we propose the use of semantic kernels that are based on term-term correlations for improving the effectiveness of document clustering algorithms. The used kernels measure proximity between documents based on how their terms are statistically correlated. We analyze semantic kernels that capture different aspects of correlations between terms, and evaluate them by conducting experiments on different benchmark data sets. Results show that the proposed method achieves significant improvement in document clustering compared to VSM.

[1]  Naftali Tishby,et al.  Document clustering using word clusters via the information bottleneck method , 2000, SIGIR '00.

[2]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[3]  Hinrich Schütze,et al.  Projections for efficient document clustering , 1997, SIGIR '97.

[4]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[5]  P. C. Wong,et al.  Generalized vector spaces model in information retrieval , 1985, SIGIR '85.

[6]  Rasha F. Kashef,et al.  Cooperative Clustering Model and Its Applications , 2008 .

[7]  Ian H. Witten,et al.  Clustering Documents Using a Wikipedia-Based Concept Representation , 2009, PAKDD.

[8]  George Karypis,et al.  Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering , 2004, Machine Learning.

[9]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[10]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[11]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[12]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[13]  Xin Liu,et al.  Document clustering based on non-negative matrix factorization , 2003, SIGIR.

[14]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[15]  George Karypis,et al.  CLUTO - A Clustering Toolkit , 2002 .

[16]  Steffen Staab,et al.  WordNet improves text document clustering , 2003, SIGIR 2003.

[17]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[18]  Charles Nicholas,et al.  Feature Selection and Document Clustering , 2004 .