论文信息 - Identifying document topics using the Wikipedia category network

Identifying document topics using the Wikipedia category network

In the last few years the size and coverage of Wikipedia, a freely available on-line encyclopedia has reached the point where it can be utilized similar to an ontology or taxonomy to identify the topics discussed in a document. In this paper we will show that even a simple algorithm that exploits only the titles and categories of Wikipedia articles can characterize documents by Wikipedia categories surprisingly well. We test the reliability of our method by predicting categories of Wikipedia articles themselves based on their bodies, and by performing classification and clustering on 20 Newsgroups and RCV1, representing documents by their Wikipedia categories instead of their texts.

Peter Sch

[1] Chin-Yew Lin. Knowledge-Based Automatic Topic Identification , 1995, ACL.

[2] Davide Fossati,et al. The problem of ontology alignment on the Web: A first report , 2006 .

[3] Gilad Mishne,et al. Using a Reference Corpus as a User Model for Focused Information Retrieval , 2005, J. Digit. Inf. Manag..

[4] Markus Krötzsch,et al. Semantic Wikipedia , 2006, WikiSym '06.

[5] Gilad Mishne,et al. Using Wikipedia at the TREC QA Track , 2004, TREC.

[6] Maria Ruiz-Casado,et al. Automatic Assignment of Wikipedia Encyclopedic Entries to WordNet Synsets , 2005, AWIC.

[7] Yiming Yang,et al. RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[8] Marcin Paprzycki,et al. The world of travel: a comparative analysis of classification methods , 2003, Ann. UMCS Informatica.

[9] Rosni Abdullah,et al. Automatic Topic Identification Using Ontology Hierarchy , 2001, CICLing.

[10] Chin-Yew Lin,et al. Robust automated topic identification , 1997 .

[11] R. Navigli,et al. Automatically extending, pruning and trimming general purpose ontologies , 2002, IEEE International Conference on Systems, Man and Cybernetics.

[12] Russ B. Altman,et al. Time to Organize the Bioinformatics Resourceome , 2005, PLoS Comput. Biol..

[13] Maria Ruiz-Casado,et al. Automatic Extraction of Semantic Relationships for WordNet by Means of Pattern Learning from Wikipedia , 2005, NLDB.