Low-Rank Approximations of Second-Order Document Representations

Document embeddings, created with methods ranging from simple heuristics to statistical and deep models, are widely applicable. Bag-of-vectors models for documents include the mean and quadratic approaches (Torki, 2018). We present evidence that quadratic statistics alone, without the mean information, can offer superior accuracy, fast document comparison, and compact document representations. In matching news articles to their comment threads, low-rank representations of only 3-4 times the size of the mean vector give most accurate matching, and in standard sentence comparison tasks, results are state of the art despite faster computation. Similarity measures are discussed, and the Frobenius product implicit in the proposed method is contrasted to Wasserstein or Bures metric from the transportation theory. We also shortly demonstrate matching of unordered word lists to documents, to measure topicality or sentiment of documents.

[1]  Marco Cuturi,et al.  Generalizing Point Embeddings using the Wasserstein Space of Elliptical Distributions , 2018, NeurIPS.

[2]  Nathan Halko,et al.  Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , 2009, SIAM Rev..

[3]  Andrew McCallum,et al.  Energy and Policy Considerations for Deep Learning in NLP , 2019, ACL.

[4]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[5]  Marwan Torki,et al.  A Document Descriptor using Covariance of Word Vectors , 2018, ACL.

[6]  Ilya Sutskever,et al.  Learning to Generate Reviews and Discovering Sentiment , 2017, ArXiv.

[7]  Kevin Gimpel,et al.  Towards Universal Paraphrastic Sentence Embeddings , 2015, ICLR.

[8]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[9]  Tomas Mikolov,et al.  Advances in Pre-Training Distributed Word Representations , 2017, LREC.

[10]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[11]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[12]  Sanjeev Arora,et al.  A Simple but Tough-to-Beat Baseline for Sentence Embeddings , 2017, ICLR.

[13]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[14]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[15]  Jason Lee,et al.  Fully Character-Level Neural Machine Translation without Explicit Segmentation , 2016, TACL.

[16]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[17]  Douwe Kiela,et al.  SentEval: An Evaluation Toolkit for Universal Sentence Representations , 2018, LREC.

[18]  Kevin Gimpel,et al.  From Paraphrase Database to Compositional Paraphrase Model and Back , 2015, Transactions of the Association for Computational Linguistics.

[19]  Yannis Stavrakas,et al.  Multivariate Gaussian Document Representation from Word Embeddings for Text Categorization , 2017, EACL.

[20]  Partha Pratim Talukdar,et al.  Unsupervised Document Representation using Partition Word-Vectors Averaging , 2018 .

[21]  Minmin Chen,et al.  Efficient Vector Representation for Documents through Corruption , 2017, ICLR.

[22]  Léon Bottou,et al.  Wasserstein GAN , 2017, ArXiv.

[23]  R. Bhatia,et al.  On the Bures–Wasserstein distance between positive definite matrices , 2017, Expositiones Mathematicae.

[24]  J. Graham,et al.  Moral Foundations Theory: The Pragmatic Validity of Moral Pluralism , 2012 .

[25]  Christian S. Perone,et al.  Evaluation of sentence embeddings in downstream and linguistic probing tasks , 2018, ArXiv.

[26]  Kate M. Johnson,et al.  Morality Between the Lines : Detecting Moral Sentiment In Text , 2016 .

[27]  Douwe Kiela,et al.  Poincaré Embeddings for Learning Hierarchical Representations , 2017, NIPS.