Learning to expand queries using entities

A substantial fraction of web search queries contain references to entities, such as persons, organizations, and locations. Recently, methods that exploit named entities have been shown to be more effective for query expansion than traditional pseudorelevance feedback methods. In this article, we introduce a supervised learning approach that exploits named entities for query expansion using Wikipedia as a repository of high‐quality feedback documents. In contrast with existing entity‐oriented pseudorelevance feedback approaches, we tackle query expansion as a learning‐to‐rank problem. As a result, not only do we select effective expansion terms but we also weigh these terms according to their predicted effectiveness. To this end, we exploit the rich structure of Wikipedia articles to devise discriminative term features, including each candidate term's proximity to the original query terms, as well as its frequency across multiple article fields and in category and infobox descriptors. Experiments on three Text REtrieval Conference web test collections attest the effectiveness of our approach, with gains of up to 23.32% in terms of mean average precision, 19.49% in terms of precision at 10, and 7.86% in terms of normalized discounted cumulative gain compared with a state‐of‐the‐art approach for entity‐oriented query expansion.

[1]  Claudio Carpineto,et al.  Query Difficulty, Robustness, and Selective Application of Query Expansion , 2004, ECIR.

[2]  David Hawking,et al.  Overview of the TREC-2001 Web track , 2002 .

[3]  W. Bruce Croft,et al.  Predicting query performance , 2002, SIGIR '02.

[4]  Raj Jain,et al.  The art of computer systems performance analysis - techniques for experimental design, measurement, simulation, and modeling , 1991, Wiley professional computing.

[5]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[6]  James Allan,et al.  Effective and efficient user interaction for long queries , 2008, SIGIR '08.

[7]  Silviu Cucerzan,et al.  Acronym-Expansion Recognition and Ranking on the Web , 2007, 2007 IEEE International Conference on Information Reuse and Integration.

[8]  Hongfei Lin,et al.  Social annotation in query expansion: a machine learning approach , 2011, SIGIR.

[9]  Chris Buckley,et al.  Improving automatic query expansion , 1998, SIGIR '98.

[10]  Hang Li,et al.  Named entity recognition in query , 2009, SIGIR.

[11]  Monika Henzinger,et al.  Analysis of a very large web search engine query log , 1999, SIGF.

[12]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[13]  W. Bruce Croft,et al.  Effective query formulation with multiple information sources , 2012, WSDM '12.

[14]  John D. Lafferty,et al.  Model-based feedback in the language modeling approach to information retrieval , 2001, CIKM '01.

[15]  W. Bruce Croft,et al.  Indri : A language-model based search engine for complex queries ( extended version ) , 2005 .

[16]  Pu-Jen Cheng,et al.  Selecting Effective Terms for Query Formulation , 2009, AIRS.

[17]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[18]  Peter Boros,et al.  Query Segmentation for Web Search , 2003, WWW.

[19]  Pushpak Bhattacharyya,et al.  "A term is known by the company it keeps": On Selecting a Good Expansion Set in Pseudo-Relevance Feedback , 2009, ICTIR.

[20]  Xin Li,et al.  Context sensitive stemming for web search , 2007, SIGIR.

[21]  Charles L. A. Clarke,et al.  The TREC 2006 Terabyte Track , 2006, TREC.

[22]  Filip Radlinski,et al.  Comparing the sensitivity of information retrieval metrics , 2010, SIGIR.

[23]  Ophir Frieder,et al.  Automatic web query classification using labeled and unlabeled training data , 2005, SIGIR '05.

[24]  ChengXiang Zhai,et al.  Statistical Language Models for Information Retrieval: A Critical Review , 2008, Found. Trends Inf. Retr..

[25]  Ee-Peng Lim,et al.  Measuring article quality in wikipedia: models and evaluation , 2007, CIKM '07.

[26]  Wagner Meira,et al.  Set-based vector model: An efficient approach for correlation-based ranking , 2005, TOIS.

[27]  Andrei Z. Broder,et al.  Classifying search queries using the Web as a source of knowledge , 2009, TWEB.

[28]  Korris Fu-Lai Chung,et al.  Improving weak ad-hoc queries using wikipedia asexternal corpus , 2007, SIGIR.

[29]  Wei-Ying Ma,et al.  Probabilistic query expansion using query logs , 2002, WWW '02.

[30]  Yang Xu,et al.  Query dependent pseudo-relevance feedback based on wikipedia , 2009, SIGIR.

[31]  Iadh Ounis,et al.  Finding good feedback documents , 2009, CIKM.

[32]  Wladmir Cardoso Brandão,et al.  EXPLOITING ENTITY SEMANTICS FOR QUERY EXPANSION , 2011 .

[33]  Stephen E. Robertson,et al.  Selecting good expansion terms for pseudo-relevance feedback , 2008, SIGIR '08.

[34]  Hang Li Query Understanding in Web Search - by Large Scale Log Data Mining and Statistical Learning , 2010 .

[35]  Tie-Yan Liu,et al.  Learning to rank for information retrieval , 2009, SIGIR.

[36]  Max Mühlhäuser,et al.  Analyzing and accessing Wikipedia as a lexical semantic resource , 2007 .

[37]  Yang Zhang,et al.  Exploring Distributional Similarity Based Models for Query Spelling Correction , 2006, ACL.

[38]  Iadh Ounis,et al.  Combining fields for query expansion and adaptive query expansion , 2007, Inf. Process. Manag..

[39]  Amanda Spink,et al.  Real life, real users, and real needs: a study and analysis of user queries on the web , 2000, Inf. Process. Manag..

[40]  Pável Calado,et al.  Automatic quality assessment of content created collaboratively by web communities: a case study of wikipedia , 2009, JCDL '09.

[41]  ChengXiang Zhai,et al.  Tapping into knowledge base for concept feedback: leveraging conceptnet to improve search results for difficult queries , 2012, WSDM '12.

[42]  Mor Naaman,et al.  Methods for extracting place semantics from Flickr tags , 2009, TWEB.

[43]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[44]  Yang Xu,et al.  Entity-based query reformulation using wikipedia , 2008, CIKM '08.

[45]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[46]  Fernando Diaz,et al.  Improving the estimation of relevance models using large external corpora , 2006, SIGIR.

[47]  Ian H. Witten,et al.  A knowledge-based search engine powered by wikipedia , 2007, CIKM '07.

[48]  M. de Rijke,et al.  Exploiting External Collections for Query Expansion , 2012, TWEB.