Exploiting Locality of Wikipedia Links in Entity Ranking

Information retrieval from web and XML document collections ever more focused on returning entities instead of web pages or XML elements. There are many research fields involving named entities; one such field is known as entity ranking, where one goal is to rank entities in response to a query supported with a short list of entity examples. In this paper, we describe our approach to ranking entities from the Wikipedia XML document collection. Our approach utilises the known categories and the link structure of Wikipedia, and more importantly, exploits link co-occurrences to improve the effectiveness of entity ranking. Using the broad context of a full Wikipedia page as a baseline, we evaluate two different algorithms for identifying narrow contexts around the entity examples: one that uses predefined types of elements such as paragraphs, lists and tables; and another that dynamically identifies the contexts by utilising the underlying XML document structure. Our experiments demonstrate that the locality of Wikipedia links can be exploited to significantly improve the effectiveness of entity ranking.

[1]  Kentaro Torisawa,et al.  Exploiting Wikipedia as External Knowledge for Named Entity Recognition , 2007, EMNLP.

[2]  Ricardo Baeza-Yates,et al.  A Comparison of Open Source Search Engines , 2007 .

[3]  Silviu Cucerzan,et al.  Large-Scale Named Entity Disambiguation Based on Wikipedia Data , 2007, EMNLP.

[4]  Ludovic Denoyer,et al.  The Wikipedia XML Corpus , 2006, INEX.

[5]  Fabian M. Suchanek,et al.  ESTER: efficient search on text, entities, and relations , 2007, SIGIR.

[6]  Ludovic Denoyer,et al.  The XML Wikipedia Corpus , 2006 .

[7]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[8]  Mounia Lalmas,et al.  Overview of the INEX 2007 Entity Ranking Track , 2008, INEX.

[9]  James A. Thom,et al.  Hybrid XML Retrieval: Combining Information Retrieval and a Native XML Database , 2005, Information Retrieval.

[10]  D. N. F. Awang Iskandar,et al.  Social Media Retrieval Using Image Features and Structured Text , 2006, INEX.

[11]  Teruko Mitamura,et al.  Knowledge-based extraction of named entities , 2002, CIKM '02.

[12]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[13]  Nick Craswell,et al.  Overview of the TREC 2006 Enterprise Track , 2006, TREC.

[14]  Brian D. Davison,et al.  Topical link analysis for web search , 2006, SIGIR.

[15]  Wei-Ying Ma,et al.  Block-level link analysis , 2004, SIGIR '04.

[16]  Andrew Trotman,et al.  Comparative Evaluation of XML Information Retrieval Systems: 5th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2006 Dagstuhl Castle, Germany, December 17-20, 2006 Revised and Selected Papers , 2005 .