Features and Aggregators for Web-scale Entity Search

We focus on two research issues in entity search: scoring a document or snippet that potentially supports a candidate entity, and aggregating scores from different snippets into an entity score. Proximity scoring has been studied in IR outside the scope of entity search. However, aggregation has been hardwired except in a few cases where probabilistic language models are used. We instead explore simple, robust, discriminative ranking algorithms, with informative snippet features and broad families of aggregation functions. Our first contribution is a study of proximity-cognizant snippet features. In contrast with prior work which uses hardwired "proximity kernels" that implement a fixed decay with distance, we present a "universal" feature encoding which jointly expresses the perplexity (informativeness) of a query term match and the proximity of the match to the entity mention. Our second contribution is a study of aggregation functions. Rather than train the ranking algorithm on snippets and then aggregate scores, we directly train on entities such that the ranking algorithm takes into account the aggregation function being used. Our third contribution is an extensive Web-scale evaluation of the above algorithms on two data sets having quite different properties and behavior. The first one is the W3C dataset used in TREC-scale enterprise search, with pre-annotated entity mentions. The second is a Web-scale open-domain entity search dataset consisting of 500 million Web pages, which contain about 8 billion token spans annotated automatically with two million entities from 200,000 entity types in Wikipedia. On the TREC dataset, the performance of our system is comparable to the currently prevalent systems. On the much larger and noisier Web dataset, our system delivers significantly better performance than all other systems, with 8% MAP improvement over the closest competitor.

[1]  Craig MacDonald,et al.  Learning Models for Ranking Aggregates , 2011, ECIR.

[2]  Luo Si,et al.  Discriminative models of integrating document evidence and document-candidate associations for expert search , 2010, SIGIR '10.

[3]  Filip Radlinski,et al.  A support vector method for optimizing average precision , 2007, SIGIR.

[4]  O. Chapelle Large margin optimization of ranking measures , 2007 .

[5]  ChengXiang Zhai,et al.  Positional language models for information retrieval , 2009, SIGIR.

[6]  Kevin Chen-Chuan Chang,et al.  Beyond pages: supporting efficient, scalable entity search with dual-inversion index , 2010, EDBT '10.

[7]  Stephen E. Robertson,et al.  A probabilistic model of information retrieval: development and comparative experiments - Part 1 , 2000, Inf. Process. Manag..

[8]  Gregory N. Hullender,et al.  Learning to rank using gradient descent , 2005, ICML.

[9]  Tao Tao,et al.  An exploration of proximity measures in information retrieval , 2007, SIGIR.

[10]  Cong Yu,et al.  EntityEngine: answering entity-relationship queries using shallow semantics , 2010, CIKM '10.

[11]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[12]  Craig MacDonald,et al.  Voting for candidates: adapting data fusion techniques for an expert search task , 2006, CIKM '06.

[13]  Maksims Volkovs,et al.  BoltzRank: learning to maximize expected ranking gain , 2009, ICML '09.

[14]  Mounia Lalmas,et al.  Learning Aggregation Functions for Expert Search , 2010, ECAI.

[15]  Kevin Chen-Chuan Chang,et al.  EntityRank: Searching Entities Directly and Holistically , 2007, VLDB.

[16]  Tie-Yan Liu,et al.  Learning to Rank for Information Retrieval , 2011 .

[17]  Charles L. A. Clarke,et al.  Term proximity scoring for ad-hoc retrieval on very large text collections , 2006, SIGIR.

[18]  Tie-Yan Liu,et al.  Learning to rank: from pairwise approach to listwise approach , 2007, ICML '07.

[19]  W. Bruce Croft,et al.  Proximity-based document representation for named entity retrieval , 2007, CIKM '07.

[20]  Fabian M. Suchanek,et al.  Yago: A Core of Semantic Knowledge Unifying WordNet and Wikipedia , 2007 .

[21]  Stephen E. Robertson,et al.  A probabilistic model of information retrieval: development and comparative experiments - Part 2 , 2000, Inf. Process. Manag..

[22]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[23]  Soumen Chakrabarti,et al.  Optimizing scoring functions and indexes for proximity search in type-annotated corpora , 2006, WWW '06.

[24]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[25]  Klaus Obermayer,et al.  Support vector learning for ordinal regression , 1999 .

[26]  Thomas Hofmann,et al.  Learning to Rank with Nonsmooth Cost Functions , 2006, NIPS.

[27]  Masao Fukushima,et al.  On the Global Convergence of the BFGS Method for Nonconvex Unconstrained Optimization Problems , 2000, SIAM J. Optim..

[28]  M. de Rijke,et al.  A language modeling framework for expert finding , 2009, Inf. Process. Manag..

[29]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[30]  M. de Rijke,et al.  Formal models for expert finding in enterprise corpora , 2006, SIGIR.

[31]  ChengXiang Zhai,et al.  Probabilistic Models for Expert Finding , 2007, ECIR.

[32]  Yoram Singer,et al.  An Efficient Boosting Algorithm for Combining Preferences by , 2013 .

[33]  Rong Jin,et al.  Learning to Rank by Optimizing NDCG Measure , 2009, NIPS.