Text document clustering based on frequent word meaning sequences

Most of existing text clustering algorithms use the vector space model, which treats documents as bags of words. Thus, word sequences in the documents are ignored, while the meaning of natural languages strongly depends on them. In this paper, we propose two new text clustering algorithms, named Clustering based on Frequent Word Sequences (CFWS) and Clustering based on Frequent Word Meaning Sequences (CFWMS). A word is the word form showing in the document, and a word meaning is the concept expressed by synonymous word forms. A word (meaning) sequence is frequent if it occurs in more than certain percentage of the documents in the text database. The frequent word (meaning) sequences can provide compact and valuable information about those text documents. For experiments, we used the Reuters-21578 text collection, CISI documents of the Classic data set [Classic data set, ftp://ftp.cs.cornell.edu/pub/smart/], and a corpus of the Text Retrieval Conference (TREC) [High Accuracy Retrieval from Documents (HARD) Track of Text Retrieval Conference, 2004]. Our experimental results show that CFWS and CFWMS have much better clustering accuracy than Bisecting k-means (BKM) [M. Steinbach, G. Karypis, V. Kumar, A Comparison of Document Clustering Techniques, KDD-2000 Workshop on Text Mining, 2000], a modified bisecting k-means using background knowledge (BBK) [A. Hotho, S. Staab, G. Stumme, Ontologies improve text document clustering, in: Proceedings of the 3rd IEEE International Conference on Data Mining, 2003, pp. 541-544] and Frequent Itemset-based Hierarchical Clustering (FIHC) [B.C.M. Fung, K. Wang, M. Ester, Hierarchical document clustering using frequent itemsets, in: Proceedings of SIAM International Conference on Data Mining, 2003] algorithms.

[1]  Helena Ahonen-Myka Discovery of Frequent Word Sequences in Text , 2002, Pattern Detection and Discovery.

[2]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[3]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[4]  Mark Nelson,et al.  Fast string searching with suffix trees , 1996 .

[5]  Ke Wang,et al.  Hierarchical Document Clustering , 2009, Encyclopedia of Data Warehousing and Mining.

[6]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[7]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[8]  M. Farach Optimal suffix tree construction with large alphabets , 1997, Proceedings 38th Annual Symposium on Foundations of Computer Science.

[9]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[10]  Martin Ester,et al.  Frequent term-based text clustering , 2002, KDD.

[11]  Esko Ukkonen,et al.  On-line construction of suffix trees , 1995, Algorithmica.

[12]  Steffen Staab,et al.  Ontologies improve text document clustering , 2003, Third IEEE International Conference on Data Mining.

[13]  James Allan,et al.  HARD Track Overview in TREC 2003: High Accuracy Retrieval from Documents , 2003, TREC.

[14]  Oren Etzioni,et al.  Fast and Intuitive Clustering of Web Documents , 1997, KDD.

[15]  Chinatsu Aone,et al.  Fast and effective text mining using linear-time document clustering , 1999, KDD '99.

[16]  S. Muthukrishnan,et al.  Overcoming the memory bottleneck in suffix tree construction , 1998, Proceedings 39th Annual Symposium on Foundations of Computer Science (Cat. No.98CB36280).

[17]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[18]  Hwee Tou Ng,et al.  Mining topic-specific concepts and definitions on the web , 2003, WWW '03.

[19]  George Karypis,et al.  Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering , 2004, Machine Learning.

[20]  Gad M. Landau,et al.  Fast Parallel and Serial Approximate String Matching , 1989, J. Algorithms.

[21]  Soon Myoung Chung,et al.  Multipass Algorithms for Mining Association Rules in Text Databases , 2001, Knowledge and Information Systems.

[22]  Dimitar Kazakov,et al.  WordNet-based text document clustering , 2004 .

[23]  Pushpak Bhattacharyya,et al.  Text Clustering using Semantics , 2002 .

[24]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[25]  Ke Wang,et al.  Clustering transactions using large items , 1999, CIKM '99.

[26]  Stefan Kurtz,et al.  Reducing the space requirement of suffix trees , 1999 .

[27]  Robert Giegerich,et al.  From Ukkonen to McCreight and Weiner: A Unifying View of Linear-Time Suffix Tree Construction , 1997, Algorithmica.

[28]  Antoine Doucet,et al.  Non-Contiguous Word Sequences for Information Retrieval , 2004 .

[29]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[30]  Dekang Lin,et al.  WordNet: An Electronic Lexical Database , 1998 .

[31]  Benjamin C. M. Fung,et al.  Hierarchical Document Clustering using Frequent Itemsets , 2003, SDM.

[32]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[33]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.