Document clustering using character N-grams: a comparative evaluation with term-based and word-based clustering

We propose a novel method for document clustering using character N-grams. In the traditional vector-space model, the documents are represented as vectors, in which each dimension corresponds to a word. We propose a document representation based on the most frequent character N-grams, with window size of up to 10 characters. We derive a new distance measure, which produces uniformly better results when compared to the word-based and term-based methods. The result becomes more significant in the light of the robustness of the N-gram method with no language-dependent preprocessing. Experiments on the performance of a clustering algorithm on a variety of test document corpora demonstrate that the N-gram representation with n=3 outperforms both word and term representations. The comparison between word and term representations depends on the data set and the selected dimensionality.

[1]  Anton Leuski,et al.  Evaluating document clustering for interactive information retrieval , 2001, CIKM '01.

[2]  Martin Ester,et al.  Frequent term-based text clustering , 2002, KDD.

[3]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[4]  Hideki Mima,et al.  Automatic recognition of multi-word terms:. the C-value/NC-value method , 2000, International Journal on Digital Libraries.

[5]  Zeev Volkovich,et al.  Text mining with information-theoretic clustering , 2003, Comput. Sci. Eng..

[6]  Evangelos E. Milios,et al.  AUTOMATIC TERM EXTRACTION AND DOCUMENT SIMILARITY IN SPECIAL TEXT CORPORA , 2003 .

[7]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[8]  Fuchun Peng,et al.  N-GRAM-BASED AUTHOR PROFILES FOR AUTHORSHIP ATTRIBUTION , 2003 .

[9]  W. B. Cavnar,et al.  Using An N-Gram-Based Document Representation With A Vector Processing Retrieval Model , 1994, TREC.

[10]  G. Karypis,et al.  Clustering In A High-Dimensional Space Using Hypergraph Models , 2004 .

[11]  Greg Hamerly,et al.  Learning the k in k-means , 2003, NIPS.

[12]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[13]  R. Mooney,et al.  Impact of Similarity Measures on Web-page Clustering , 2000 .

[14]  Evangelos E. Milios,et al.  Term-Based Clustering and Summarization of Web Page Collections , 2004, Canadian Conference on AI.

[15]  Eric Brill,et al.  A Simple Rule-Based Part of Speech Tagger , 1992, HLT.

[16]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[17]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[18]  David D. Lewis,et al.  Reuters-21578 Text Categorization Test Collection, Distribution 1.0 , 1997 .

[19]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[20]  Paul S. Bradley,et al.  Refining Initial Points for K-Means Clustering , 1998, ICML.

[21]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[22]  Qiang Yang,et al.  Correlation-based document clustering using web logs , 2001, Proceedings of the 34th Annual Hawaii International Conference on System Sciences.

[23]  George Karypis,et al.  CLUTO - A Clustering Toolkit , 2002 .

[24]  Yuen Ren Chao,et al.  Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology , 1950 .

[25]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[26]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .