GPX: Ad-Hoc Queries and Automated Link Discovery in the Wikipedia

The INEX 2007 evaluation was based on the Wikipedia collection. In this paper we describe some modifications to the GPX search engine and the approach taken in the Ad-hoc and the Link-the-Wiki tracks. In earlier version of GPX scores were recursively propagated from text containing nodes, through ancestors, all the way to the document root of the XML tree. In this paper we describe a simplification whereby the score of each node is computed directly, doing away with the score propagation mechanism. Results indicate slightly improved performance. The GPX search engine was used in the Link-the-Wiki track to identify prospective incoming links to new Wikipedia pages. We also describe a simple and efficient approach to the identification of prospective outgoing links in new Wikipedia pages. We present and discuss evaluation results.

[1]  David Ellis,et al.  On the measurement of inter-linker consistency and retrieval effectiveness in hypertext databases , 1994, SIGIR '94.

[2]  Ludovic Denoyer,et al.  The XML Wikipedia Corpus , 2006 .

[3]  Ludovic Denoyer,et al.  The Wikipedia XML corpus , 2006, SIGF.

[4]  James Allan Building Hypertext Using Information Retrieval , 1997, Inf. Process. Manag..

[5]  Stephen E. Robertson,et al.  Understanding inverse document frequency: on theoretical arguments for IDF , 2004, J. Documentation.

[6]  Stephen J. Green,et al.  Automated Link Generation: Can we do Better than Term Repetition? , 1998, Comput. Networks.

[7]  Shlomo Geva GPX - Gardens Point XML IR at INEX 2006 , 2006, INEX.

[8]  Anastasio Tombros,et al.  Comparative Evaluation of XML Information Retrieval Systems, 5th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2006, Dagstuhl Castle, Germany, December 17-20, 2006, Revised and Selected Papers , 2007, INEX.

[9]  Jihong Zeng,et al.  From keywords to links: an automatic approach , 2004, International Conference on Information Technology: Coding and Computing, 2004. Proceedings. ITCC 2004..

[10]  Alan F. Smeaton,et al.  Automatic link generation , 1999, CSUR.

[11]  Stephen J. Green,et al.  Building Hypertext Links By Computing Semantic Similarity , 1999, IEEE Trans. Knowl. Data Eng..

[12]  Andrew Trotman,et al.  Comparative Evaluation of XML Information Retrieval Systems: 5th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2006 Dagstuhl Castle, Germany, December 17-20, 2006 Revised and Selected Papers , 2005 .

[13]  M. de Rijke,et al.  Discovering missing links in Wikipedia , 2005, LinkKDD '05.