Textual Similarity with a Bag-of-Embedded-Words Model

While words in documents are generally treated as discrete entities, they can be embedded in a Euclidean space which reflects an a priori notion of similarity between them. In such a case, a text document can be viewed as a bag-of-embedded-words (BoEW): a set of real-valued vectors. We propose a novel document representation based on such continuous word embeddings. It consists in non-linearly mapping the word-embeddings in a higher-dimensional space and in aggregating them into a document-level representation. We report retrieval experiments in the case where the word-embeddings are computed from standard topic models showing significant improvements with respect to the original topic models.

[1]  Thomas L. Griffiths,et al.  Integrating Topics and Syntax , 2004, NIPS.

[2]  Quan Wang,et al.  Regularized latent semantic indexing , 2011, SIGIR.

[3]  Eric P. Xing,et al.  Sparse Additive Generative Models of Text , 2011, ICML.

[4]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[5]  Florent Perronnin,et al.  Fisher Kernels on Visual Vocabularies for Image Categorization , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[7]  C. J. van Rijsbergen,et al.  Probabilistic models of information retrieval based on measuring the divergence from randomness , 2002, TOIS.

[8]  David Haussler,et al.  Exploiting Generative Models in Discriminative Classifiers , 1998, NIPS.

[9]  Ruslan Salakhutdinov,et al.  Probabilistic Matrix Factorization , 2007, NIPS.

[10]  Jean-Cédric Chappelier,et al.  PLSI: The True Fisher Kernel and beyond , 2009, ECML/PKDD.

[11]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[12]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[13]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[14]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[15]  Florent Perronnin,et al.  Large-scale image retrieval with compressed Fisher vectors , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[16]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..