论文信息 - Document clustering using character N-grams: a comparative evaluation with term-based and word-based clustering

Document clustering using character N-grams: a comparative evaluation with term-based and word-based clustering

We propose a novel method for document clustering using character N-grams. In the traditional vector-space model, the documents are represented as vectors, in which each dimension corresponds to a word. We propose a document representation based on the most frequent character N-grams, with window size of up to 10 characters. We derive a new distance measure, which produces uniformly better results when compared to the word-based and term-based methods. The result becomes more significant in the light of the robustness of the N-gram method with no language-dependent preprocessing. Experiments on the performance of a clustering algorithm on a variety of test document corpora demonstrate that the N-gram representation with n=3 outperforms both word and term representations. The comparison between word and term representations depends on the data set and the selected dimensionality.

[1] Anton Leuski,et al. Evaluating document clustering for interactive information retrieval , 2001, CIKM '01.

[2] Martin Ester,et al. Frequent term-based text clustering , 2002, KDD.

[3] Ian H. Witten,et al. Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[4] Hideki Mima,et al. Automatic recognition of multi-word terms:. the C-value/NC-value method , 2000, International Journal on Digital Libraries.

[5] Zeev Volkovich,et al. Text mining with information-theoretic clustering , 2003, Comput. Sci. Eng..

[6] Evangelos E. Milios,et al. AUTOMATIC TERM EXTRACTION AND DOCUMENT SIMILARITY IN SPECIAL TEXT CORPORA , 2003 .

[7] Gerard Salton,et al. A vector space model for automatic indexing , 1975, CACM.

[8] Fuchun Peng,et al. N-GRAM-BASED AUTHOR PROFILES FOR AUTHORSHIP ATTRIBUTION , 2003 .

[9] W. B. Cavnar,et al. Using An N-Gram-Based Document Representation With A Vector Processing Retrieval Model , 1994, TREC.

[10] G. Karypis,et al. Clustering In A High-Dimensional Space Using Hypergraph Models , 2004 .

[11] Greg Hamerly,et al. Learning the k in k-means , 2003, NIPS.

[12] J. MacQueen. Some methods for classification and analysis of multivariate observations , 1967 .

[13] R. Mooney,et al. Impact of Similarity Measures on Web-page Clustering , 2000 .

[14] Evangelos E. Milios,et al. Term-Based Clustering and Summarization of Web Page Collections , 2004, Canadian Conference on AI.

[15] Eric Brill,et al. A Simple Rule-Based Part of Speech Tagger , 1992, HLT.

[16] Gerard Salton,et al. Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[17] M. F. Porter,et al. An algorithm for suffix stripping , 1997 .

[18] David D. Lewis,et al. Reuters-21578 Text Categorization Test Collection, Distribution 1.0 , 1997 .

[19] Hans-Peter Kriegel,et al. The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[20] Paul S. Bradley,et al. Refining Initial Points for K-Means Clustering , 1998, ICML.

[21] Hans-Peter Kriegel,et al. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[22] Qiang Yang,et al. Correlation-based document clustering using web logs , 2001, Proceedings of the 34th Annual Hawaii International Conference on System Sciences.

[23] George Karypis,et al. CLUTO - A Clustering Toolkit , 2002 .

[24] Yuen Ren Chao,et al. Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology , 1950 .

[25] George Karypis,et al. A Comparison of Document Clustering Techniques , 2000 .

[26] Jiawei Han,et al. Data Mining: Concepts and Techniques , 2000 .