论文信息 - Language Model-Based Document Clustering Using Random Walks

Language Model-Based Document Clustering Using Random Walks

We propose a new document vector representation specifically designed for the document clustering task. Instead of the traditional term-based vectors, a document is represented as an n-dimensional vector, where n is the number of documents in the cluster. The value at each dimension of the vector is closely related to the generation probability based on the language model of the corresponding document. Inspired by the recent graph-based NLP methods, we reinforce the generation probabilities by iterating random walks on the underlying graph representation. Experiments with k-means and hierarchical clustering algorithms show significant improvements over the alternative tf·idf vector representation.

Günes Erkan | Günes Erkan

[1] J. Munkres. ALGORITHMS FOR THE ASSIGNMENT AND TRANSIORTATION tROBLEMS* , 1957 .

[2] Valerie Isham,et al. Non‐Negative Matrices and Markov Chains , 1983 .

[3] Michael McGill,et al. Introduction to Modern Information Retrieval , 1983 .

[4] Sergey Brin,et al. The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[5] Duncan J. Watts,et al. Collective dynamics of ‘small-world’ networks , 1998, Nature.

[6] Chinatsu Aone,et al. Fast and effective text mining using linear-time document clustering , 1999, KDD '99.

[7] Naftali Tishby,et al. Document clustering using word clusters via the information bottleneck method , 2000, SIGIR '00.

[8] Chris H. Q. Ding,et al. Bipartite graph partitioning and data clustering , 2001, CIKM '01.

[9] Stan Matwin,et al. A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization , 2001 .

[10] David Harel,et al. Clustering spatial data using random walks , 2001, KDD '01.

[11] Inderjit S. Dhillon,et al. Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.