P-SIF: Document Embeddings Using Partition Averaging

Simple weighted averaging of word vectors often yields effective representations for sentences which outperform sophisticated seq2seq neural models in many tasks. While it is desirable to use the same method to represent documents as well, unfortunately, the effectiveness is lost when representing long documents involving multiple sentences. One of the key reasons is that a longer document is likely to contain words from many different topics; hence, creating a single vector while ignoring all the topical structure is unlikely to yield an effective document representation. This problem is less acute in single sentences and other short text fragments where the presence of a single topic is most likely. To alleviate this problem, we present P-SIF, a partitioned word averaging model to represent long documents. P-SIF retains the simplicity of simple weighted word averaging while taking a document's topical structure into account. In particular, P-SIF learns topic-specific vectors from a document and finally concatenates them all to represent the overall document. We provide theoretical justifications on the correctness of P-SIF. Through a comprehensive set of experiments, we demonstrate P-SIF's effectiveness compared to simple weighted averaging and many other baselines.

[1]  Xuanjing Huang,et al.  Learning Context-Sensitive Word Embeddings with Neural Tensor Skip-Gram Model , 2015, IJCAI.

[2]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[3]  Jürgen Schmidhuber,et al.  Learning Precise Timing with LSTM Recurrent Networks , 2003, J. Mach. Learn. Res..

[4]  Nan Hua,et al.  Universal Sentence Encoder , 2018, ArXiv.

[5]  Amitabha Mukerjee,et al.  Words are not Equal: Graded Weighting Model for Building Composite Document Vectors , 2015, ICON.

[6]  Sanjeev Arora,et al.  A Latent Variable Model Approach to PMI-based Word Embeddings , 2015, TACL.

[7]  Sanjeev Arora,et al.  A Simple but Tough-to-Beat Baseline for Sentence Embeddings , 2017, ICLR.

[8]  David M. Blei,et al.  Topic Modeling in Embedding Spaces , 2019, Transactions of the Association for Computational Linguistics.

[9]  Liang-Tien Chia,et al.  Local features are not lonely – Laplacian sparse coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[10]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[11]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[12]  Christopher E. Moody,et al.  Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec , 2016, ArXiv.

[13]  Geoffrey E. Hinton,et al.  Three new graphical models for statistical language modelling , 2007, ICML '07.

[14]  Rajarshi Das,et al.  Gaussian LDA for Topic Models with Word Embeddings , 2015, ACL.

[15]  Matteo Pagliardini,et al.  Unsupervised Learning of Sentence Embeddings Using Compositional n-Gram Features , 2017, NAACL.

[16]  Bo Fu,et al.  Integrating Topic Modeling with Word Embeddings by Mixtures of vMFs , 2016, COLING.

[17]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[18]  Dat Quoc Nguyen,et al.  Improving Topic Models with Latent Feature Word Representations , 2015, TACL.

[19]  Kevin Gimpel,et al.  Revisiting Recurrent Networks for Paraphrastic Sentence Embeddings , 2017, ACL.

[20]  Ting Wang,et al.  Improving Distributed Word Representation and Topic Model by Word-Topic Mixture Model , 2016, ACML.

[21]  Harish Karnick,et al.  SCDV : Sparse Composite Document Vectors using soft clustering over distributional representations , 2016, EMNLP.

[22]  Chris Callison-Burch,et al.  PPDB: The Paraphrase Database , 2013, NAACL.

[23]  David J. Field,et al.  Sparse coding with an overcomplete basis set: A strategy employed by V1? , 1997, Vision Research.

[24]  M. Elad,et al.  $rm K$-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation , 2006, IEEE Transactions on Signal Processing.

[25]  Hal Daumé,et al.  Deep Unordered Composition Rivals Syntactic Methods for Text Classification , 2015, ACL.

[26]  Xinyu Dai,et al.  Topic2Vec: Learning distributed representations of topics , 2015, 2015 International Conference on Asian Language Processing (IALP).

[27]  Kawin Ethayarajh,et al.  Unsupervised Random Walk Sentence Embeddings: A Strong but Simple Baseline , 2018, Rep4NLP@ACL.

[28]  Chunyan Miao,et al.  Generative Topic Embedding: a Continuous Representation of Documents , 2016, ACL.

[29]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[30]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[31]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[32]  Junhua He,et al.  LTSG: Latent Topical Skip-Gram for Mutually Learning Topic Model and Vector Representations , 2018, PRCV.

[33]  Sungzoon Cho,et al.  Bag-of-concepts: Comprehending document representation through clustering words in distributed representation , 2017, Neurocomputing.

[34]  Minmin Chen,et al.  Efficient Vector Representation for Documents through Corruption , 2017, ICLR.

[35]  David M. Blei,et al.  The Dynamic Embedded Topic Model , 2019, ArXiv.

[36]  Yihong Gong,et al.  Linear spatial pyramid matching using sparse coding for image classification , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  Kevin Gimpel,et al.  Towards Universal Paraphrastic Sentence Embeddings , 2015, ICLR.

[38]  Kevin Gimpel,et al.  From Paraphrase Database to Compositional Paraphrase Model and Back , 2015, Transactions of the Association for Computational Linguistics.

[39]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[40]  Wang Ling,et al.  Two/Too Simple Adaptations of Word2Vec for Syntax Problems , 2015, NAACL.

[41]  Zhiyuan Liu,et al.  Topical Word Embeddings , 2015, AAAI.

[42]  Sanjeev Arora,et al.  Linear Algebraic Structure of Word Senses, with Applications to Polysemy , 2016, TACL.

[43]  Bing Liu,et al.  Topic Modeling using Topics from Many Domains, Lifelong Learning and Big Data , 2014, ICML.

[44]  Christopher D. Manning,et al.  Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks , 2015, ACL.

[45]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[46]  Pradeep Ravikumar,et al.  Word Mover’s Embedding: From Word2Vec to Document Embedding , 2018, EMNLP.

[47]  Harish Karnick,et al.  Product Classification in E-Commerce using Distributional Semantics , 2016, COLING.

[48]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[49]  Douwe Kiela,et al.  SentEval: An Evaluation Toolkit for Universal Sentence Representations , 2018, LREC.

[50]  Omer Levy,et al.  Neural Word Embedding as Implicit Matrix Factorization , 2014, NIPS.

[51]  Christian S. Perone,et al.  Evaluation of sentence embeddings in downstream and linguistic probing tasks , 2018, ArXiv.

[52]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[53]  A. Bruckstein,et al.  K-SVD : An Algorithm for Designing of Overcomplete Dictionaries for Sparse Representation , 2005 .

[54]  Zellig S. Harris,et al.  Distributional Structure , 1954 .