Embedding-based Query Language Models

Word embeddings, which are low-dimensional vector representations of vocabulary terms that capture the semantic similarity between them, have recently been shown to achieve impressive performance in many natural language processing tasks. The use of word embeddings in information retrieval, however, has only begun to be studied. In this paper, we explore the use of word embeddings to enhance the accuracy of query language models in the ad-hoc retrieval task. To this end, we propose to use word embeddings to incorporate and weight terms that do not occur in the query, but are semantically related to the query terms. We describe two embedding-based query expansion models with different assumptions. Since pseudo-relevance feedback methods that use the top retrieved documents to update the original query model are well-known to be effective, we also develop an embedding-based relevance model, an extension of the effective and robust relevance model approach. In these models, we transform the similarity values obtained by the widely-used cosine similarity with a sigmoid function to have more discriminative semantic similarity values. We evaluate our proposed methods using three TREC newswire and web collections. The experimental results demonstrate that the embedding-based methods significantly outperform competitive baselines in most cases. The embedding-based methods are also shown to be more robust than the baselines.

[1]  Kevyn Collins-Thompson,et al.  Reducing the risk of query expansion via robust constrained optimization , 2009, CIKM.

[2]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[3]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[4]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[5]  Jian-Yun Nie,et al.  Using query contexts in information retrieval , 2007, SIGIR.

[6]  Po Hu,et al.  Learning Continuous Word Embedding with Metadata for Question Retrieval in Community Question Answering , 2015, ACL.

[7]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[8]  W. Bruce Croft,et al.  Query expansion using local and global document analysis , 1996, SIGIR '96.

[9]  Marie-Francine Moens,et al.  Monolingual and Cross-Lingual Information Retrieval Models Based on (Bilingual) Word Embeddings , 2015, SIGIR.

[10]  Ellen M. Voorhees,et al.  Query expansion using lexical-semantic relations , 1994, SIGIR '94.

[11]  Omer Levy,et al.  Improving Distributional Similarity with Lessons Learned from Word Embeddings , 2015, TACL.

[12]  Mandar Mitra,et al.  Word Embedding based Generalized Language Model for Information Retrieval , 2015, SIGIR.

[13]  Florent Perronnin,et al.  Aggregating Continuous Word Embeddings for Information Retrieval , 2013, CVSM@ACL.

[14]  M. de Rijke,et al.  Short Text Similarity with Word Embeddings , 2015, CIKM.

[15]  Azadeh Shakery,et al.  Axiomatic Analysis for Improving the Log-Logistic Feedback Model , 2016, SIGIR.

[16]  W. Bruce Croft,et al.  Relevance-Based Language Models , 2001, SIGIR '01.

[17]  Nick Craswell,et al.  Query Expansion with Locally-Trained Word Embeddings , 2016, ACL.

[18]  John D. Lafferty,et al.  Model-based feedback in the language modeling approach to information retrieval , 2001, CIKM '01.

[19]  ChengXiang Zhai,et al.  Estimation of statistical translation models based on mutual information for ad hoc information retrieval , 2010, SIGIR.

[20]  Jean-Pierre Chevallet,et al.  A Comparison of Deep Learning Based Query Expansion with Pseudo-Relevance Feedback and Mutual Information , 2016, ECIR.

[21]  W. Bruce Croft,et al.  A Language Modeling Approach to Information Retrieval , 1998, SIGIR Forum.

[22]  James P. Callan,et al.  Learning to Reweight Terms with Distributed Representations , 2015, SIGIR.

[23]  Yoshua Bengio,et al.  Learning Concept Embeddings for Query Expansion by Quantum Entropy Minimization , 2014, AAAI.

[24]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[25]  Claudio Carpineto,et al.  A Survey of Automatic Query Expansion in Information Retrieval , 2012, CSUR.

[26]  Dean P. Foster,et al.  Multi-View Learning of Word Embeddings via CCA , 2011, NIPS.

[27]  Fernando Diaz,et al.  UMass at TREC 2004: Novelty and HARD , 2004, TREC.

[28]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[29]  Guido Zuccon,et al.  Integrating and Evaluating Neural Word Embeddings in Information Retrieval , 2015, ADCS.

[30]  John D. Lafferty,et al.  Document Language Models, Query Models, and Risk Minimization for Information Retrieval , 2001, SIGIR Forum.

[31]  ChengXiang Zhai,et al.  A comparative study of methods for estimating query language models with pseudo feedback , 2009, CIKM.

[32]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[33]  W. Bruce Croft,et al.  Quary Expansion Using Local and Global Document Analysis , 1996, SIGIR Forum.