Ranking Entities for Web Queries Through Text and Knowledge

When humans explain complex topics, they naturally talk about involved entities, such as people, locations, or events. In this paper, we aim at automating this process by retrieving and ranking entities that are relevant to understand free-text web-style queries like Argentine British relations, which typically demand a set of heterogeneous entities with no specific target type like, for instance, Falklands_-War} or Margaret-_Thatcher, as answer. Standard approaches to entity retrieval rely purely on features from the knowledge base. We approach the problem from the opposite direction, namely by analyzing web documents that are found to be query-relevant. Our approach hinges on entity linking technology that identifies entity mentions and links them to a knowledge base like Wikipedia. We use a learning-to-rank approach and study different features that use documents, entity mentions, and knowledge base entities -- thus bridging document and entity retrieval. Since established benchmarks for this problem do not exist, we use TREC test collections for document ranking and collect custom relevance judgments for entities. Experiments on TREC Robust04 and TREC Web13/14 data show that: i) single entity features, like the frequency of occurrence within the top-ranke documents, or the query retrieval score against a knowledge base, perform generally well; ii) the best overall performance is achieved when combining different features that relate an entity to the query, its document mentions, and its knowledge base representation.

[1]  Niranjan Balasubramanian,et al.  Beyond Ranked Lists in Web Search: Aggregating Web Content into Topic Pages , 2010, Int. J. Semantic Comput..

[2]  Ladislav Hluchý,et al.  The SemSets model for ad-hoc semantic list search , 2012, WWW.

[3]  Mihai Surdeanu,et al.  Learning to Rank Answers to Non-Factoid Questions from Web Collections , 2011, CL.

[4]  Gianluca DemartiniClaudiu Why finding entities in Wikipedia is difficult, sometimes , 2010 .

[5]  Christian Biemann,et al.  Text: now in 2D! A framework for lexical expansion with contextual similarity , 2013, J. Lang. Model..

[6]  Peter Mika,et al.  Ad-hoc object retrieval in the web of data , 2010, WWW '10.

[7]  Ellen M. Voorhees,et al.  TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing) , 2005 .

[8]  Simone Paolo Ponzetto,et al.  Queripidia: Query-specific Wikipedia Construction , 2014 .

[9]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[10]  Jens Lehmann,et al.  Template-based question answering over RDF data , 2012, WWW.

[11]  Haixun Wang,et al.  Probase: a probabilistic taxonomy for text understanding , 2012, SIGMOD Conference.

[12]  Tie-Yan Liu,et al.  Learning to rank for information retrieval , 2009, SIGIR.

[13]  Rahul Gupta,et al.  Biperpedia: An Ontology for Search Applications , 2014, Proc. VLDB Endow..

[14]  James A. Thom,et al.  Exploiting Locality of Wikipedia Links in Entity Ranking , 2008, ECIR.

[15]  Eugene Agichtein,et al.  Improving entity search over linked data by modeling latent semantics , 2013, CIKM.

[16]  Laura Dietz,et al.  A neighborhood relevance model for entity linking , 2013, OAIR.

[17]  Jaap Kamps,et al.  Exploiting the category structure of Wikipedia for entity ranking , 2013, Artif. Intell..

[18]  Gianluca Demartini,et al.  Overview of the INEX 2009 Entity Ranking Track , 2009, INEX.

[19]  Stephan Bloehdorn,et al.  Semantic Kernels for Text Classification Based on Topological Measures of Feature Similarity , 2006, Sixth International Conference on Data Mining (ICDM'06).

[20]  W. Bruce Croft,et al.  A Markov random field model for term dependencies , 2005, SIGIR '05.

[21]  Jaap Kamps,et al.  Entity ranking using Wikipedia as a pivot , 2010, CIKM.

[22]  Paolo Ferragina,et al.  Fast and Accurate Annotation of Short Texts with Wikipedia Pages , 2010, IEEE Software.

[23]  Jaap Kamps,et al.  Overview of the INEX 2013 Linked Data Track , 2013, CLEF.

[24]  Simone Paolo Ponzetto,et al.  Collaboratively built semi-structured content and Artificial Intelligence: The story so far , 2013, Artif. Intell..

[25]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[26]  Oren Kurland,et al.  A ranking framework for entity oriented search using Markov random fields , 2012, JIWES '12.

[27]  Ravi Kumar,et al.  A web of concepts , 2009, PODS.

[28]  Daniel Gillick,et al.  A New Entity Salience Task with Millions of Training Examples , 2014, EACL.

[29]  Gerhard Weikum,et al.  STICS: searching with strings, things, and cats , 2014, SIGIR.

[30]  Jaana Kekäläinen,et al.  IR evaluation methods for retrieving highly relevant documents , 2000, SIGIR Forum.

[31]  Tie-Yan Liu,et al.  Learning to Rank for Information Retrieval , 2011 .

[32]  Jens Lehmann,et al.  DBpedia - A crystallization point for the Web of Data , 2009, J. Web Semant..

[33]  Gerhard Weikum,et al.  Discovering emerging entities with ambiguous names , 2014, WWW.

[34]  Gianluca Demartini,et al.  Combining inverted indices and structured search for ad-hoc object retrieval , 2012, SIGIR '12.

[35]  James Allan,et al.  Entity query feature expansion using knowledge base links , 2014, SIGIR.

[36]  José Luis Vicedo González,et al.  TREC: Experiment and evaluation in information retrieval , 2007, J. Assoc. Inf. Sci. Technol..

[37]  W. Bruce Croft,et al.  Linear feature-based models for information retrieval , 2007, Information Retrieval.

[38]  Ellen M. Voorhees,et al.  The TREC robust retrieval track , 2005, SIGF.

[39]  Simone Paolo Ponzetto,et al.  Knowledge-based graph document modeling , 2014, WSDM.

[40]  Paul Thomas,et al.  Overview of the TREC 2009 Entity Track , 2009, TREC.

[41]  Evgeniy Gabrilovich,et al.  Concept-Based Information Retrieval Using Explicit Semantic Analysis , 2011, TOIS.

[42]  Gerhard Weikum,et al.  Language-model-based ranking for queries on RDF-graphs , 2009, CIKM.

[43]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.