Semantic Smoothing for Model-based Document Clustering

A document is often full of class-independent "general" words and short of class-specific "core " words, which leads to the difficulty of document clustering. We argue that both problems will be relieved after suitable smoothing of document models in agglomerative approaches and of cluster models in partitional approaches, and hence improve clustering quality. To the best of our knowledge, most model-based clustering approaches use Laplacian smoothing to prevent zero probability while most similarity-based approaches employ the heuristic TF*IDF scheme to discount the effect of "general" words. Inspired by a series of statistical translation language model for text retrieval, we propose in this paper a novel smoothing method referred to as context-sensitive semantic smoothing for document clustering purpose. The comparative experiment on three datasets shows that model-based clustering approaches with semantic smoothing is effective in improving cluster quality.

[1]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[2]  Ian Witten,et al.  Data Mining , 2000 .

[3]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[4]  Joydeep Ghosh,et al.  Under Consideration for Publication in Knowledge and Information Systems Generative Model-based Document Clustering: a Comparative Study , 2003 .

[5]  Joydeep Ghosh,et al.  Frequency sensitive competitive learning for clustering on high-dimensional hyperspheres , 2002, Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN'02 (Cat. No.02CH37290).

[6]  SmadjaFrank Retrieving collocations from text , 1993 .

[7]  G. Karypis,et al.  Criterion Functions for Document Clustering ∗ Experiments and Analysis , 2001 .

[8]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[9]  John D. Lafferty,et al.  Two-stage language models for information retrieval , 2002, SIGIR '02.

[10]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[11]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[12]  Frank Smadja,et al.  Retrieving Collocations from Text: Xtract , 1993, CL.

[13]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[14]  John D. Lafferty,et al.  Information retrieval as statistical translation , 1999, SIGIR '99.

[15]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[16]  Xiaohua Hu,et al.  Integration of semantic-based bipartite graph representation and mutual refinement strategy for biomedical literature clustering , 2006, KDD '06.

[17]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[18]  Xiaohua Hu,et al.  Clustering Large Collection of Biomedical Literature Based on Ontology-Enriched Bipartite Graph Representation and Mutual Refinement Strategy , 2006, PAKDD.

[19]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[20]  Xiaohua Hu,et al.  Context-sensitive semantic smoothing for the language modeling approach to genomic IR , 2006, SIGIR.