论文信息 - A new unsupervised method for document clustering by using WordNet lexical and conceptual relations

A new unsupervised method for document clustering by using WordNet lexical and conceptual relations

Text document clustering provides an effective and intuitive navigation mechanism to organize a large amount of retrieval results by grouping documents in a small number of meaningful classes. Many well-known methods of text clustering make use of a long list of words as vector space which is often unsatisfactory for a couple of reasons: first, it keeps the dimensionality of the data very high, and second, it ignores important relationships between terms like synonyms or antonyms. Our unsupervised method solves both problems by using ANNIE and WordNet lexical categories and WordNet ontology in order to create a well structured document vector space whose low dimensionality allows common clustering algorithms to perform well. For the clustering step we have chosen the bisecting k-means and the Multipole tree, a modified version of the Antipole tree data structure for, respectively, their accuracy and speed.

Diego Reforgiato Recupero | D. Recupero | D. R. Recupero

[1] Stephen J. Green. Building hypertext links in newspaper articles using semantic similarity , 1997 .

[2] Rada Mihalcea,et al. Using WordNet and Lexical Operators to Improve Internet Searches , 2000, IEEE Internet Comput..

[3] Narayanan Kulathuramaiyer,et al. Semantic Feature Selection Using WordNet , 2004, IEEE/WIC/ACM International Conference on Web Intelligence (WI'04).

[4] Oren Etzioni,et al. Fast and Intuitive Clustering of Web Documents , 1997, KDD.

[5] Diego Reforgiato Recupero,et al. Antipole tree indexing to support range search and k-nearest neighbor search in metric spaces , 2005, IEEE Transactions on Knowledge and Data Engineering.

[6] Evangelos E. Milios,et al. Using Unsupervised Learning to Guide Resampling in Imbalanced Data Sets , 2001, AISTATS.

[7] Luis Alfonso Ureña López,et al. Integrating Linguistic Resources in TC through WSD , 2001, Comput. Humanit..

[8] Steffen Staab,et al. WordNet improves text document clustering , 2003, SIGIR 2003.

[9] Alberto J. Cañas,et al. Using WordNet for Word Sense Disambiguation to Support Concept Map Construction , 2003, SPIRE.

[10] Oren Etzioni,et al. Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[11] Francisco Azuaje,et al. Improving expression data mining through cluster validation , 2003, 4th International IEEE EMBS Special Topic Conference on Information Technology Applications in Biomedicine, 2003..

[12] Stephen J. Green,et al. Building Hypertext Links By Computing Semantic Similarity , 1999, IEEE Trans. Knowl. Data Eng..

[13] Martin F. Porter,et al. An algorithm for suffix stripping , 1997, Program.

[14] George Karypis,et al. A Comparison of Document Clustering Techniques , 2000 .

[15] Yi Li,et al. COOLCAT: an entropy-based algorithm for categorical clustering , 2002, CIKM '02.

[16] Daniel Boley,et al. Principal Direction Divisive Partitioning , 1998, Data Mining and Knowledge Discovery.

[17] Chinatsu Aone,et al. Fast and effective text mining using linear-time document clustering , 1999, KDD '99.

[18] Teofilo F. GONZALEZ,et al. Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[19] I K Fodor,et al. A Survey of Dimension Reduction Techniques , 2002 .

[20] Martin Ester,et al. Frequent term-based text clustering , 2002, KDD.

[21] Manuel de Buenaga Rodríguez,et al. Using WordNet to Complement Training Information in Text Categorization , 1997, ArXiv.

[22] Huan Liu,et al. Subspace clustering for high dimensional data: a review , 2004, SKDD.

[23] Padhraic Smyth,et al. Clustering Using Monte Carlo Cross-Validation , 1996, KDD.