Entity ranking in Wikipedia: utilising categories, links and topic difficulty prediction

Entity ranking has recently emerged as a research field that aims at retrieving entities as answers to a query. Unlike entity extraction where the goal is to tag names of entities in documents, entity ranking is primarily focused on returning a ranked list of relevant entity names for the query. Many approaches to entity ranking have been proposed, and most of them were evaluated on the INEX Wikipedia test collection. In this paper, we describe a system we developed for ranking Wikipedia entities in answer to a query. The entity ranking approach implemented in our system utilises the known categories, the link structure of Wikipedia, as well as the link co-occurrences with the entity examples (when provided) to retrieve relevant entities as answers to the query. We also extend our entity ranking approach by utilising the knowledge of predicted classes of topic difficulty. To predict the topic difficulty, we generate a classifier that uses features extracted from an INEX topic definition to classify the topic into an experimentally pre-determined class. This knowledge is then utilised to dynamically set the optimal values for the retrieval parameters of our entity ranking system. Our experiments demonstrate that the use of categories and the link structure of Wikipedia can significantly improve entity ranking effectiveness, and that topic difficulty prediction is a promising approach that could also be exploited to further improve the entity ranking performance.

[1]  Teruko Mitamura,et al.  Knowledge-based extraction of named entities , 2002, CIKM '02.

[2]  Djoerd Hiemstra,et al.  Structured Document Retrieval, Multimedia Retrieval, and Entity Ranking Using PF/Tijah , 2008, INEX.

[3]  Nick Craswell,et al.  Overview of the TREC 2006 Enterprise Track , 2006, TREC.

[4]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[5]  James A. Thom,et al.  Hybrid XML Retrieval: Combining Information Retrieval and a Native XML Database , 2005, Information Retrieval.

[6]  James A. Thom,et al.  Exploiting Locality of Wikipedia Links in Entity Ranking , 2008, ECIR.

[7]  Josiane Mothe,et al.  Linguistic features to predict query difficulty , 2005, SIGIR 2005.

[8]  Gianluca Demartini,et al.  Overview of the INEX 2008 Entity Ranking Track , 2009, INEX.

[9]  David Yarowsky,et al.  Language Independent Named Entity Recognition Combining Morphological and Contextual Evidence , 1999, EMNLP.

[10]  Ellen M. Voorhees,et al.  The Twelfth Text Retrieval Conference, TREC 2003 , 2004 .

[11]  Jintao Li,et al.  Query Performance Prediction for Information Retrieval Based on Covering Topic Score , 2008, Journal of Computer Science and Technology.

[12]  Marc Ehrig,et al.  Similarity for Ontologies - A Comprehensive Framework , 2005, ECIS.

[13]  Jens Grivolla,et al.  Automatic Classification of Queries by Expected Retrieval Performance , 2005 .

[14]  Wei-Ying Ma,et al.  Block-level link analysis , 2004, SIGIR '04.

[15]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[16]  W. Bruce Croft,et al.  Predicting query performance , 2002, SIGIR '02.

[17]  Stefano Mizzaro,et al.  The Good, the Bad, the Difficult, and the Easy: Something Wrong with Information Retrieval Evaluation? , 2008, ECIR.

[18]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[19]  Stephen E. Robertson,et al.  Hits hits TREC: exploring IR evaluation results with network analysis , 2007, SIGIR.

[20]  James A. Thom,et al.  Ontology evaluation using wikipedia categories for browsing , 2007, CIKM '07.

[21]  James A. Thom,et al.  Entity ranking in Wikipedia , 2007, SAC '08.

[22]  Ellen M. Voorhees,et al.  The TREC robust retrieval track , 2005, SIGF.

[23]  Mounia Lalmas,et al.  Overview of the INEX 2007 Entity Ranking Track , 2008, INEX.

[24]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[25]  Iadh Ounis,et al.  Query performance prediction , 2006, Inf. Syst..

[26]  Fabian M. Suchanek,et al.  ESTER: efficient search on text, entities, and relations , 2007, SIGIR.

[27]  D. N. F. Awang Iskandar,et al.  Social Media Retrieval Using Image Features and Structured Text , 2006, INEX.

[28]  Alistair Moffat,et al.  Score standardization for inter-collection comparison of retrieval systems , 2008, SIGIR '08.

[29]  Craig A. Knoblock,et al.  Wrapper Maintenance: A Machine Learning Approach , 2011, J. Artif. Intell. Res..

[30]  Elad Yom-Tov,et al.  SIGIR workshop report: predicting query difficulty - methods and applications , 2005, SIGF.

[31]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL 2006.

[32]  Jianfeng Gao,et al.  A Supervised Learning Approach to Entity Search , 2006, AIRS.

[33]  James A. Thom,et al.  Use of Wikipedia Categories in Entity Ranking , 2007, ArXiv.

[34]  Matthew Denny,et al.  Nodose version 2.0 , 1999, SIGMOD '99.

[35]  M. de Rijke,et al.  Entity Retrieval , 2007 .

[36]  Ludovic Denoyer,et al.  The XML Wikipedia Corpus , 2006 .

[37]  Charles L. A. Clarke,et al.  The TREC terabyte retrieval track , 2005, SIGF.

[38]  James A. Thom,et al.  Using Wikipedia Categories and Links in Entity Ranking , 2007, INEX.

[39]  Elad Yom-Tov,et al.  Juru at TREC 2004: Experiments with Prediction of Query Difficulty , 2004, TREC.

[40]  Ian H. Witten,et al.  Data mining - practical machine learning tools and techniques, Second Edition , 2005, The Morgan Kaufmann series in data management systems.

[41]  Ian Witten,et al.  Data Mining , 2000 .

[42]  Kentaro Torisawa,et al.  Exploiting Wikipedia as External Knowledge for Named Entity Recognition , 2007, EMNLP.

[43]  W. Bruce Croft,et al.  Query performance prediction in web search environments , 2007, SIGIR.

[44]  Gabriella Kazai Initiative for the Evaluation of XML Retrieval , 2009 .

[45]  K. Kwok,et al.  An Attempt to Identify Weakest and Strongest Queries , 2005 .

[46]  Anne-Marie Vercoustre,et al.  A Descriptive Language for Information Object Reuse through Virtual Documents , 1997, OOIS.

[47]  Jovan Pehcevski,et al.  Topic Difficulty Prediction in Entity Ranking , 2008, INEX.

[48]  Ludovic Denoyer,et al.  The Wikipedia XML Corpus , 2006, INEX.

[49]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[50]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[51]  Mounira Harzallah,et al.  A Typology Of Ontology-Based Semantic Measures , 2005, EMOI-INTEROP.

[52]  Arnaud Sahuguet,et al.  Building Light-Weight Wrappers for Legacy Web Data-Sources Using W4F , 1999, VLDB.

[53]  Jaap Kamps,et al.  Understanding differences between search requests in XML element retrieval , 2006 .

[54]  Silviu Cucerzan,et al.  Large-Scale Named Entity Disambiguation Based on Wikipedia Data , 2007, EMNLP.

[55]  LaforenzaDomenico,et al.  SIGIR workshop report , 2005 .

[56]  Ismailcem Budak Arpinar,et al.  Ontology-Driven Automatic Entity Disambiguation in Unstructured Text , 2006, SEMWEB.

[57]  Jaap Kamps,et al.  Finding Entities or Information Using Annotations , 2009 .

[58]  Brian D. Davison,et al.  Topical link analysis for web search , 2006, SIGIR.

[59]  Nicholas Kushmerick,et al.  Wrapper induction: Efficiency and expressiveness , 2000, Artif. Intell..