The Importance of Link Evidence in Wikipedia

Wikipedia is one of the most popular information sources on the Web. The free encyclopedia is densely linked. The link structure in Wikipedia differs from the Web at large: internal links in Wikipedia are typically based on words naturally occurring in a page, and link to another semantically related entry. Our main aim is to find out if Wikipedia's link structure can be exploited to improve ad hoc information retrieval. We first analyse the relation between Wikipedia links and the relevance of pages. We then experiment with use of link evidence in the focused retrieval of Wikipedia content, based on the test collection of INEX 2006. Our main findings are: First, our analysis of the link structure reveals that the Wikipedia link structure is a (possibly weak) indicator of relevance. Second, our experiments on INEX ad hoc retrieval tasks reveal that if the link evidence is made sensitive to the local context we see a significant improvement of retrieval effectiveness. Hence, in contrast with earlier TREC experiments using crawled Web data, we have shown that Wikipedia's link structure can help improve the effectiveness of ad hoc retrieval.

[1]  José Luis Vicedo González,et al.  TREC: Experiment and evaluation in information retrieval , 2007, J. Assoc. Inf. Sci. Technol..

[2]  Ludovic Denoyer,et al.  The Wikipedia XML Corpus , 2006, INEX.

[3]  Gabriella Kazai Initiative for the Evaluation of XML Retrieval , 2009 .

[4]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[5]  Djoerd Hiemstra,et al.  The Importance of Prior Probabilities for Entry Page Search , 2002, SIGIR '02.

[6]  Gabriella Kazai,et al.  INEX 2006 Evaluation Measures , 2006, INEX.

[7]  Djoerd Hiemstra,et al.  Using language models for information retrieval , 2001 .

[8]  FaloutsosMichalis,et al.  On power-law relationships of the Internet topology , 1999 .

[9]  Jaap Kamps,et al.  Web-centric language models , 2005, CIKM '05.

[10]  M. de Rijke,et al.  An Element-based Approach to XML Retrieval , 2004 .

[11]  Ellen M. Voorhees,et al.  The Ninth Text REtrieval Conference (TREC-9) , 2001 .

[12]  N. Fuhr PAN-Uncovering Plagiarism , Authorship , and Social Software Misuse ImageCLEF 2013-Cross Language Image Annotation and Retrieval INEX-INitiative for the Evaluation of XML retrieval , 2002 .

[13]  Andrei Broder,et al.  A taxonomy of web search , 2002, SIGF.

[14]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[15]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[16]  Wessel Kraaij,et al.  TNO-UT at TREC-9: How Different are Web Documents? , 2000, TREC.

[17]  Ludovic Denoyer,et al.  The XML Wikipedia Corpus , 2006 .

[18]  James P. Callan,et al.  Combining document representations for known-item search , 2003, SIGIR.

[19]  Michalis Faloutsos,et al.  On power-law relationships of the Internet topology , 1999, SIGCOMM '99.

[20]  David Hawking,et al.  Overview of the TREC-9 Web Track , 2000, TREC.