A Thesaurus Construction Method from Large ScaleWeb Dictionaries

Web-based dictionaries, such as Wikipedia, have become dramatically popular among the Internet users in past several years. The important characteristic of Web-based dictionary is not only the huge amount of articles, but also hyperlinks. Hyperlinks have various information more than just providing transfer function between pages. In this paper, we propose an efficient method to analyze the link structure of Web-based dictionaries to construct an association thesaurus. We have already applied it to Wikipedia, a huge scale Web-based dictionary which has a dense link structure, as a corpus. We developed a search engine for evaluation, then conducted a number of experiments to compare our method with other traditional methods such as cooccurrence analysis.

[1]  Maria Ruiz-Casado,et al.  Automatic Assignment of Wikipedia Encyclopedic Entries to WordNet Synsets , 2005, AWIC.

[2]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[3]  Brian D. Davison Topical locality in the Web , 2000, SIGIR '00.

[4]  Hinrich Schütze,et al.  A Cooccurrence-Based Thesaurus and Two Applications to Information Retrieval , 1994, Inf. Process. Manag..

[5]  Jaideep Srivastava,et al.  Web mining: information and pattern discovery on the World Wide Web , 1997, Proceedings Ninth IEEE International Conference on Tools with Artificial Intelligence.

[6]  Yuen-Hsien Tseng,et al.  Automatic thesaurus generation for Chinese documents , 2002, J. Assoc. Inf. Sci. Technol..

[7]  J. Giles Internet encyclopaedias go head to head , 2005, Nature.

[8]  Wei-Ying Ma,et al.  Building a web thesaurus from web link structure , 2003, SIGIR.

[9]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[10]  Ian H. Witten,et al.  Mining Domain-Specific Thesauri from Wikipedia: A Case Study , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[11]  Simone Paolo Ponzetto,et al.  WikiRelate! Computing Semantic Relatedness Using Wikipedia , 2006, AAAI.

[12]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[13]  Hsinchun Chen,et al.  Automatic Thesaurus Generation for an Electronic Community System , 1995, J. Am. Soc. Inf. Sci..

[14]  Carolyn J. Crouch,et al.  A cluster-based approach to thesaurus construction , 1988, SIGIR '88.

[15]  Eric Brill,et al.  A Simple Rule-Based Part of Speech Tagger , 1992, HLT.

[16]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.