论文信息 - WordNet improves text document clustering - 字舞流文

WordNet improves text document clustering

Text document clustering plays an important role in providing intuitive navigation and browsing mechanisms by organizing large amounts of information into a small number of meaningful clusters. The bag of words representation used for these clustering methods is often unsatisfactory as it ignores relationships between important terms that do not co-occur literally. In order to deal with the problem, we integrate background knowledge — in our application Wordnet — into the process of clustering text documents. We cluster the documents by a standard partitional algorithm. Our experimental evaluation on Reuters newsfeeds compares clustering results with pre-categorizations of news. In the experiments, improvements of results by background knowledge compared to the baseline can be shown for many interesting tasks.

Steffen Staab | Andreas Hotho | Gerd Stumme | Steffen Staab | A. Hotho | Gerd Stumme

[1] Steffen Staab,et al. Explaining Text Clustering Results Using Semantic Structures , 2003, PKDD.

[2] George Karypis,et al. A Comparison of Document Clustering Techniques , 2000 .

[3] Claudio Carpineto,et al. FUB at TREC-10 Web Track: A Probabilistic Framework for Topic Relevance Term Weighting , 2001, TREC.

[4] Luis Alfonso Ureña López,et al. Integrating Linguistic Resources in TC through WSD , 2001, Comput. Humanit..

[5] Manuel de Buenaga Rodríguez,et al. Using WordNet to Complement Training Information in Text Categorization , 1997, ArXiv.

[6] David D. Lewis,et al. Reuters-21578 Text Categorization Test Collection, Distribution 1.0 , 1997 .

[7] Julio Gonzalo,et al. Indexing with WordNet synsets can improve text retrieval , 1998, WordNet@ACL/COLING.

[8] Nancy Ide,et al. Introduction to the Special Issue on Word Sense Disambiguation: The State of the Art , 1998, Comput. Linguistics.

[9] Stephen J. Green,et al. Building Hypertext Links By Computing Semantic Similarity , 1999, IEEE Trans. Knowl. Data Eng..

[10] Gerard Salton,et al. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[11] Steffen Staab,et al. Text Clustering Based on Background Knowledge , 2003 .

[12] Stephen J. Green. Building hypertext links in newspaper articles using semantic similarity , 1997 .

[13] Rada Mihalcea,et al. Using WordNet and Lexical Operators to Improve Internet Searches , 2000, IEEE Internet Comput..

[14] Ellen M. Voorhees,et al. Query expansion using lexical-semantic relations , 1994, SIGIR '94.

[15] Richard A. Harshman,et al. Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[16] Steffen Staab,et al. KAON - Towards a Large Scale Semantic Web , 2002, EC-Web.

[17] George Karypis,et al. Fast supervised dimensionality reduction algorithm with applications to document categorization & retrieval , 2000, CIKM '00.

[18] Fabrizio Sebastiani,et al. Machine learning in automated text categorization , 2001, CSUR.

[19] Bernhard Ganter,et al. Formal Concept Analysis: Mathematical Foundations , 1998 .

[20] Patrick Pantel,et al. Document clustering with committees , 2002, SIGIR '02.

[21] George A. Miller,et al. WordNet: A Lexical Database for English , 1995, HLT.

[22] Eneko Agirre,et al. Word Sense Disambiguation using Conceptual Density , 1996, COLING.

[23] David M. Pennock,et al. Mining the peanut gallery: opinion extraction and semantic classification of product reviews , 2003, WWW '03.

[24] Martin F. Porter,et al. An algorithm for suffix stripping , 1997, Program.