论文信息 - USING CONCEPT RELATIONSHIPS TO IMPROVE DOCUMENT CATEGORIZATION

USING CONCEPT RELATIONSHIPS TO IMPROVE DOCUMENT CATEGORIZATION

In the information age we much depend on our ability to find information hidden in mostly unstructured and textual documents. This article proposes a simple method in which (as an addition to existing systems) categorization accuracy can be improved, compared to traditional techniques, without requiring any time-consuming or language-dependent computation. That result is achieved by exploiting properties observed in the entire document collection as opposed to individual documents, which may also be regarded as a construction of an approximate concept network (measuring semanticdistances). Thesepropertiesaresufficientlysimpletoavoidentailingmassivecomputations; however, they try to capture the most fundamental relationships between words or concepts. Experiments performed on the Reuters-21578 news article collections were evaluated using a set of simple measurements estimating clustering efficiency, and in addition by Cluto, a widely used document clustering software. Results show a 5–10% improvement in clustering quality over traditional tf (term frequency) or tf× idf (term frequency-inverse document frequency) based clustering.

Hassan Charaf | Péter Schönhofen

[1] Thomas Hofmann,et al. Text categorization by boosting automatically extracted concepts , 2003, SIGIR.

[2] George Karypis,et al. CLUTO - A Clustering Toolkit , 2002 .

[3] Akiko Aizawa. Linguistic Techniques to Improve the Performance of Automatic Text Categorization , 2001, NLPRS.

[4] ChengXiang Zhai,et al. Noun-Phrase Analysis in Unrestricted Text for Information Retrieval , 1996, ACL.

[5] Yiming Yang,et al. Using corpus statistics to remove redundant words in text categorization , 1996 .

[6] Mitsuru Ishizuka,et al. Keyword extraction from a single document using word co-occurrence statistical information , 2004, Int. J. Artif. Intell. Tools.

[7] G. Karypis,et al. Criterion Functions for Document Clustering ∗ Experiments and Analysis , 2001 .

[8] Fabrizio Sebastiani,et al. Supervised term weighting for automated text categorization , 2003, SAC '03.

[9] Yiming Yang,et al. Using Corpus Statistics to Remove Redundant Words in Text Categorization , 1996, J. Am. Soc. Inf. Sci..

[10] Piotr Gawrysiak,et al. Recording word position information for improved document categorization , 2002 .

[11] Stan Matwin,et al. Text Classification Using WordNet Hypernyms , 1998, WordNet@ACL/COLING.

[12] Yiming Yang,et al. An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[13] George Karypis,et al. Concept Indexing: A Fast Dimensionality Reduction Algorithm With Applications to Document Retrieval and Categorization , 2000 .

[14] Manuel de Buenaga Rodríguez,et al. Using WordNet to Complement Training Information in Text Categorization , 1997, ArXiv.

[15] Fabrizio Sebastiani,et al. A Tutorial on Automated Text Categorisation , 2000 .