Learning bag-of-embedded-words representations for textual information retrieval

Abstract Word embedding models are able to accurately model the semantic content of words. The process of extracting a set of word embedding vectors from a text document is similar to the feature extraction step of the Bag-of-Features (BoF) model, which is usually used in computer vision tasks. This gives rise to the proposed Bag-of-Embedded Words (BoEW) model that can efficiently represent text documents overcoming the limitations of previously predominantly used techniques, such as the textual Bag-of-Words model. The proposed method extends the regular BoF model by a) incorporating a weighting mask that allows for altering the importance of each learned codeword and b) by optimizing the model end-to-end (from the word embeddings to the weighting mask). Furthermore, the BoEW model also provides a fast way to fine-tune the learned representation towards the information need of the user using relevance feedback techniques. Finally, a novel spherical entropy objective function is proposed to optimize the learned representation for retrieval using the cosine similarity metric.

[1]  Marcello Pelillo,et al.  Content-based image retrieval with relevance feedback using random walks , 2011, Pattern Recognit..

[2]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[3]  Florent Perronnin,et al.  Aggregating Continuous Word Embeddings for Information Retrieval , 2013, CVSM@ACL.

[4]  Thomas G. Dietterich,et al.  Learning non-redundant codebooks for classifying complex objects , 2009, ICML '09.

[5]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[6]  Sergio Escalera,et al.  Probability-based Dynamic Time Warping and Bag-of-Visual-and-Depth-Words for Human Gesture Recognition in RGB-D , 2014, Pattern Recognit. Lett..

[7]  Chien-Hsing Chen,et al.  Improved TFIDF in big news retrieval: An empirical study , 2017, Pattern Recognit. Lett..

[8]  Christian Wolf,et al.  Supervised Learning and Codebook Optimization for Bag-of-Words Models , 2012, Cognitive Computation.

[9]  Zhuowen Tu,et al.  Max-Margin Multiple-Instance Dictionary Learning , 2013, ICML.

[10]  Anastasios Tefas,et al.  Neural Bag-of-Features learning , 2017, Pattern Recognit..

[11]  Svetlana Lazebnik,et al.  Supervised Learning of Quantizer Codebooks by Information Loss Minimization , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Florent Perronnin,et al.  Textual Similarity with a Bag-of-Embedded-Words Model , 2013, ICTIR.

[13]  Antonio Criminisi,et al.  Object categorization by learned universal visual dictionary , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[14]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[15]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[16]  Jiaul H. Paik A novel TF-IDF weighting scheme for effective ranking , 2013, SIGIR.

[17]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[18]  Anastasios Tefas,et al.  Bag of Embedded Words learning for text retrieval , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[19]  Jiafeng Guo,et al.  Analysis of the Paragraph Vector Model for Information Retrieval , 2016, ICTIR.

[20]  Vishal M. Patel,et al.  Multiple kernel-based dictionary learning for weakly supervised classification , 2015, Pattern Recognit..

[21]  Ian T. Jolliffe,et al.  Principal Component Analysis , 2002, International Encyclopedia of Statistical Science.

[22]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[23]  W. Bruce Croft,et al.  Estimating Embedding Vectors for Queries , 2016, ICTIR.

[24]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[25]  Francesc J. Ferri,et al.  A naive relevance feedback model for content-based image retrieval using multiple similarity measures , 2010, Pattern Recognit..

[26]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[27]  Gabriela Csurka,et al.  Adapted Vocabularies for Generic Visual Categorization , 2006, ECCV.

[28]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[29]  Bin Shen,et al.  Learning dictionary on manifolds for image classification , 2013, Pattern Recognit..

[30]  Alexandros Iosifidis,et al.  Multidimensional Sequence Classification Based on Fuzzy Distances and Discriminant Analysis , 2013, IEEE Transactions on Knowledge and Data Engineering.

[31]  Quoc V. Le,et al.  Document Embedding with Paragraph Vectors , 2015, ArXiv.

[32]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[33]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[34]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[35]  Alexandros Iosifidis,et al.  Discriminant Bag of Words based representation for human action recognition , 2014, Pattern Recognit. Lett..

[36]  Yubin Kuang,et al.  Optimizing Visual Vocabularies Using Soft Assignment Entropies , 2010, ACCV.

[37]  Filiberto Pla,et al.  Latent topics-based relevance feedback for video retrieval , 2016, Pattern Recognit..

[38]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[39]  Stefano Soatto,et al.  Localizing Objects with Smart Dictionaries , 2008, ECCV.

[40]  Arlindo L. Oliveira,et al.  Semi-supervised single-label text categorization using centroid-based classifiers , 2007, SAC '07.

[41]  Anastasios Tefas,et al.  Entropy Optimized Feature-Based Bag-of-Words Representation for Information Retrieval , 2016, IEEE Transactions on Knowledge and Data Engineering.

[42]  Florent Perronnin,et al.  Universal and Adapted Vocabularies for Generic Visual Categorization , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Zhiyuan Liu,et al.  A Unified Model for Word Sense Representation and Disambiguation , 2014, EMNLP.

[44]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[45]  Dewen Hu,et al.  Scene classification using a multi-resolution bag-of-features model , 2013, Pattern Recognit..

[46]  René Vidal,et al.  Joint Dictionary and Classifier Learning for Categorization of Images Using a Max-margin Framework , 2013, PSIVT.

[47]  Yubin Kuang,et al.  Supervised feature quantization with entropy optimization , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[48]  Allan Jabri,et al.  Revisiting Visual Question Answering Baselines , 2016, ECCV.

[49]  Hal Daumé,et al.  Short Text Representation for Detecting Churn in Microblogs , 2016, AAAI.

[50]  Lars Kai Hansen,et al.  Pruning the vocabulary for better context recognition , 2004, ICPR 2004.

[51]  Zhiwei Li,et al.  Max-Margin Dictionary Learning for Multiclass Image Categorization , 2010, ECCV.