Low-dimensional Query Projection based on Divergence Minimization Feedback Model for Ad-hoc Retrieval

Low-dimensional word vectors have long been used in a wide range of applications in natural language processing. In this paper we shed light on estimating query vectors in ad-hoc retrieval where a limited information is available in the original query. Pseudo-relevance feedback (PRF) is a well-known technique for updating query language models and expanding the queries with a number of relevant terms. We formulate the query updating in low-dimensional spaces first with rotating the query vector and then with scaling. These consequential steps are embedded in a query-specific projection matrix capturing both angle and scaling. In this paper we propose a new but not the most effective technique necessarily for PRF in language modeling, based on the query projection algorithm. We learn an embedded coefficient matrix for each query, whose aim is to improve the vector representation of the query by transforming it to a more reliable space, and then update the query language model. The proposed embedded coefficient divergence minimization model (ECDMM) takes top-ranked documents retrieved by the query and obtains a couple of positive and negative sample sets; these samples are used for learning the coefficient matrix which will be used for projecting the query vector and updating the query language model using a softmax function. Experimental results on several TREC and CLEF data sets in several languages demonstrate effectiveness of ECDMM. The experimental results reveal that the new formulation for the query works as well as state-of-the-art PRF techniques and outperforms state-of-the-art PRF techniques in a TREC collection in terms of MAP,P@5, and P@10 significantly.

[1]  Azadeh Shakery,et al.  Dimension Projection Among Languages Based on Pseudo-Relevant Documents for Query Translation , 2016, ECIR.

[2]  ChengXiang Zhai,et al.  Revisiting the Divergence Minimization Feedback Model , 2014, CIKM.

[3]  Mandar Mitra,et al.  Word Embedding based Generalized Language Model for Information Retrieval , 2015, SIGIR.

[4]  Marie-Francine Moens,et al.  Monolingual and Cross-Lingual Information Retrieval Models Based on (Bilingual) Word Embeddings , 2015, SIGIR.

[5]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[6]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[7]  Jean-Pierre Chevallet,et al.  A Comparison of Deep Learning Based Query Expansion with Pseudo-Relevance Feedback and Mutual Information , 2016, ECIR.

[8]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[9]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[10]  ChengXiang Zhai,et al.  A study of methods for negative relevance feedback , 2008, SIGIR '08.

[11]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[12]  Azadeh Shakery,et al.  Building a multi-domain comparable corpus using a learning to rank method† , 2016, Natural Language Engineering.

[13]  Azadeh Shakery,et al.  A Probabilistic Translation Method for Dictionary-based Cross-lingual Information Retrieval in Agglutinative Languages , 2014, ArXiv.

[14]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Indexing , 1999, SIGIR Forum.

[15]  Christopher D. Manning Understanding Human Language: Can NLP and Deep Learning Help? , 2016, SIGIR.

[16]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[17]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[18]  Florent Perronnin,et al.  Aggregating Continuous Word Embeddings for Information Retrieval , 2013, CVSM@ACL.

[19]  W. Bruce Croft,et al.  Relevance-Based Language Models , 2001, SIGIR '01.

[20]  James P. Callan,et al.  Learning to Reweight Terms with Distributed Representations , 2015, SIGIR.

[21]  Yoshua Bengio,et al.  BilBOWA: Fast Bilingual Distributed Representations without Word Alignments , 2014, ICML.

[22]  John D. Lafferty,et al.  Model-based feedback in the language modeling approach to information retrieval , 2001, CIKM '01.

[23]  Ruslan Salakhutdinov,et al.  A Multiplicative Model for Learning Distributed Text-Based Attribute Representations , 2014, NIPS.

[24]  Azadeh Shakery,et al.  Pseudo-Relevance Feedback Based on Matrix Factorization , 2016, CIKM.

[25]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[26]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[27]  Nemanja Djuric,et al.  Search Retargeting using Directed Query Embeddings , 2015, WWW.

[28]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[29]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[30]  Fernando Diaz,et al.  UMass at TREC 2004: Novelty and HARD , 2004, TREC.

[31]  Omer Levy,et al.  word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method , 2014, ArXiv.

[32]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[33]  Jeffrey Pennington,et al.  Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions , 2011, EMNLP.

[34]  Azadeh Shakery,et al.  Mining a Persian-English comparable corpus for cross-language information retrieval , 2014, Inf. Process. Manag..