A new unsupervised method for document clustering by using WordNet lexical and conceptual relations

Text document clustering provides an effective and intuitive navigation mechanism to organize a large amount of retrieval results by grouping documents in a small number of meaningful classes. Many well-known methods of text clustering make use of a long list of words as vector space which is often unsatisfactory for a couple of reasons: first, it keeps the dimensionality of the data very high, and second, it ignores important relationships between terms like synonyms or antonyms. Our unsupervised method solves both problems by using ANNIE and WordNet lexical categories and WordNet ontology in order to create a well structured document vector space whose low dimensionality allows common clustering algorithms to perform well. For the clustering step we have chosen the bisecting k-means and the Multipole tree, a modified version of the Antipole tree data structure for, respectively, their accuracy and speed.

[1]  Stephen J. Green Building hypertext links in newspaper articles using semantic similarity , 1997 .

[2]  Rada Mihalcea,et al.  Using WordNet and Lexical Operators to Improve Internet Searches , 2000, IEEE Internet Comput..

[3]  Narayanan Kulathuramaiyer,et al.  Semantic Feature Selection Using WordNet , 2004, IEEE/WIC/ACM International Conference on Web Intelligence (WI'04).

[4]  Oren Etzioni,et al.  Fast and Intuitive Clustering of Web Documents , 1997, KDD.

[5]  Diego Reforgiato Recupero,et al.  Antipole tree indexing to support range search and k-nearest neighbor search in metric spaces , 2005, IEEE Transactions on Knowledge and Data Engineering.

[6]  Evangelos E. Milios,et al.  Using Unsupervised Learning to Guide Resampling in Imbalanced Data Sets , 2001, AISTATS.

[7]  Luis Alfonso Ureña López,et al.  Integrating Linguistic Resources in TC through WSD , 2001, Comput. Humanit..

[8]  Steffen Staab,et al.  WordNet improves text document clustering , 2003, SIGIR 2003.

[9]  Alberto J. Cañas,et al.  Using WordNet for Word Sense Disambiguation to Support Concept Map Construction , 2003, SPIRE.

[10]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[11]  Francisco Azuaje,et al.  Improving expression data mining through cluster validation , 2003, 4th International IEEE EMBS Special Topic Conference on Information Technology Applications in Biomedicine, 2003..

[12]  Stephen J. Green,et al.  Building Hypertext Links By Computing Semantic Similarity , 1999, IEEE Trans. Knowl. Data Eng..

[13]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[14]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[15]  Yi Li,et al.  COOLCAT: an entropy-based algorithm for categorical clustering , 2002, CIKM '02.

[16]  Daniel Boley,et al.  Principal Direction Divisive Partitioning , 1998, Data Mining and Knowledge Discovery.

[17]  Chinatsu Aone,et al.  Fast and effective text mining using linear-time document clustering , 1999, KDD '99.

[18]  Teofilo F. GONZALEZ,et al.  Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[19]  I K Fodor,et al.  A Survey of Dimension Reduction Techniques , 2002 .

[20]  Martin Ester,et al.  Frequent term-based text clustering , 2002, KDD.

[21]  Manuel de Buenaga Rodríguez,et al.  Using WordNet to Complement Training Information in Text Categorization , 1997, ArXiv.

[22]  Huan Liu,et al.  Subspace clustering for high dimensional data: a review , 2004, SKDD.

[23]  Padhraic Smyth,et al.  Clustering Using Monte Carlo Cross-Validation , 1996, KDD.

[24]  David M. Pennock,et al.  Mining the peanut gallery: opinion extraction and semantic classification of product reviews , 2003, WWW '03.

[25]  James Allan,et al.  Introduction to topic detection and tracking , 2002 .

[26]  M. Ng,et al.  Ontology-based Distance Measure for Text Clustering , 2006 .

[27]  Ellen M. Voorhees,et al.  Query expansion using lexical-semantic relations , 1994, SIGIR '94.

[28]  Using Wordnet to Complement Training Information... , 1997 .

[29]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[30]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[31]  Jerome H. Friedman,et al.  An Overview of Predictive Learning and Function Approximation , 1994 .

[32]  LiuHuan,et al.  Subspace clustering for high dimensional data , 2004 .

[33]  Georgios Zervas,et al.  The curse of dimensionality and document clustering , 1999 .

[34]  Dimitar Kazakov,et al.  WordNet-based text document clustering , 2004 .