Association thesaurus construction methods based on link co-occurrence analysis for wikipedia

Wikipedia, a huge scale Web based encyclopedia, attracts great attention as an invaluable corpus for knowledge extraction because it has various impressive characteristics such as a huge number of articles, live updates, a dense link structure, brief anchor texts and URL identification for concepts. We have already proved that we can use Wikipedia to construct a huge scale accurate association thesaurus. The association thesaurus we constructed covers almost 1.3 million concepts and its accuracy is proved in detailed experiments. However, we still need scalable methods to analyze the huge number of Web pages and hyperlinks among articles in the Web based encyclopedia. In this paper, we propose a scalable method for constructing an association thesaurus from Wikipedia based on link co-occurrences. Link co-occurrence analysis is more scalable than link structure analysis because it is a one-pass process. We also propose integration method of tfidf and link co-occurrence analysis. Experimental results show that both our proposed methods are more accurate and scalable than conventional methods. Furthermore, the integration of tfidf achieved higher accuracy than using only link co-occurrences.

[1]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[2]  Brian D. Davison Topical locality in the Web , 2000, SIGIR '00.

[3]  Hsinchun Chen,et al.  Automatic Thesaurus Generation for an Electronic Community System , 1995, J. Am. Soc. Inf. Sci..

[4]  Takahiro Hara,et al.  A Thesaurus Construction Method from Large ScaleWeb Dictionaries , 2007, 21st International Conference on Advanced Information Networking and Applications (AINA '07).

[5]  Ian H. Witten,et al.  Mining Domain-Specific Thesauri from Wikipedia: A Case Study , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[6]  Carolyn J. Crouch,et al.  A cluster-based approach to thesaurus construction , 1988, SIGIR '88.

[7]  Yuen-Hsien Tseng,et al.  Automatic thesaurus generation for Chinese documents , 2002, J. Assoc. Inf. Sci. Technol..

[8]  Takahiro Hara,et al.  Wikipedia Mining for an Association Web Thesaurus Construction , 2007, WISE.

[9]  Maria Ruiz-Casado,et al.  Automatic Assignment of Wikipedia Encyclopedic Entries to WordNet Synsets , 2005, AWIC.

[10]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[11]  Simone Paolo Ponzetto,et al.  WikiRelate! Computing Semantic Relatedness Using Wikipedia , 2006, AAAI.

[12]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[13]  Hinrich Schütze,et al.  A Cooccurrence-Based Thesaurus and Two Applications to Information Retrieval , 1994, Inf. Process. Manag..

[14]  Eric Brill,et al.  A Simple Rule-Based Part of Speech Tagger , 1992, HLT.

[15]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[16]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[17]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[18]  Piotr Indyk,et al.  Enhanced hypertext categorization using hyperlinks , 1998, SIGMOD '98.

[19]  Wei-Ying Ma,et al.  Building a web thesaurus from web link structure , 2003, SIGIR.

[20]  J. Giles Internet encyclopaedias go head to head , 2005, Nature.