Estimating Mutual Information Between Dense Word Embeddings

Word embedding-based similarity measures are currently among the top-performing methods on unsupervised semantic textual similarity (STS) tasks. Recent work has increasingly adopted a statistical view on these embeddings, with some of the top approaches being essentially various correlations (which include the famous cosine similarity). Another excellent candidate for a similarity measure is mutual information (MI), which can capture arbitrary dependencies between the variables and has a simple and intuitive expression. Unfortunately, its use in the context of dense word embeddings has so far been avoided due to difficulties with estimating MI for continuous data. In this work we go through a vast literature on estimating MI in such cases and single out the most promising methods, yielding a simple and elegant similarity measure for word embeddings. We show that mutual information is a viable alternative to correlations, gives an excellent signal that correlates well with human judgements of similarity and rivals existing state-of-the-art unsupervised methods.

[1]  Mirella Lapata,et al.  Vector-based Models of Semantic Composition , 2008, ACL.

[2]  Rui Zhao,et al.  Fuzzy Bag-of-Words Model for Document Representation , 2018, IEEE Transactions on Fuzzy Systems.

[3]  Igor Vajda,et al.  Estimation of the Information by an Adaptive Partitioning of the Observation Space , 1999, IEEE Trans. Inf. Theory.

[4]  Vitalii Zhelezniak,et al.  Correlations between Word Vector Sets , 2019, EMNLP.

[5]  Nan Hua,et al.  Universal Sentence Encoder for English , 2018, EMNLP.

[6]  Michael Mitzenmacher,et al.  Detecting Novel Associations in Large Data Sets , 2011, Science.

[7]  S. Saigal,et al.  Relative performance of mutual information estimation methods for quantifying the dependence among short and noisy data. , 2007, Physical review. E, Statistical, nonlinear, and soft matter physics.

[8]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[9]  Chenguang Zhu,et al.  Parameter-free Sentence Embedding via Orthogonal Basis , 2019, EMNLP/IJCNLP.

[10]  Claire Cardie,et al.  SemEval-2014 Task 10: Multilingual Semantic Textual Similarity , 2014, *SEMEVAL.

[11]  Marwan Torki,et al.  A Document Descriptor using Covariance of Word Vectors , 2018, ACL.

[12]  Iryna Gurevych,et al.  Concatenated Power Mean Word Embeddings as Universal Cross-Lingual Sentence Representations , 2018, 1803.01400.

[13]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[14]  Kevin Gimpel,et al.  Towards Universal Paraphrastic Sentence Embeddings , 2015, ICLR.

[15]  Brian C. Ross Mutual Information between Discrete and Continuous Data Sets , 2014, PloS one.

[16]  Lior Wolf,et al.  In Defense of Word Embedding for Generic Text Representation , 2015, NLDB.

[17]  Fabio A. González,et al.  Text Comparison Using Soft Cardinality , 2010, SPIRE.

[18]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[19]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[20]  Guillaume A. Rousselet,et al.  A statistical framework for neuroimaging data analysis based on mutual information estimated via a gaussian copula , 2016, bioRxiv.

[21]  Eneko Agirre,et al.  *SEM 2013 shared task: Semantic Textual Similarity , 2013, *SEMEVAL.

[22]  Yoshua Bengio,et al.  Learning deep representations by mutual information estimation and maximization , 2018, ICLR.

[23]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[24]  Takafumi Kanamori,et al.  Approximating Mutual Information by Maximum Likelihood Density Ratio Estimation , 2008, FSDM.

[25]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[26]  A. Kraskov,et al.  Estimating mutual information. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[27]  Christopher Joseph Pal,et al.  Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning , 2018, ICLR.

[28]  Sanjeev Arora,et al.  A Simple but Tough-to-Beat Baseline for Sentence Embeddings , 2017, ICLR.

[29]  Felix Hill,et al.  Learning Distributed Representations of Sentences from Unlabelled Data , 2016, NAACL.

[30]  Matt J. Kusner,et al.  Supervised Word Mover's Distance , 2016, NIPS.

[31]  Aram Galstyan,et al.  Efficient Estimation of Mutual Information for Strongly Dependent Variables , 2014, AISTATS.

[32]  Kevin Gimpel,et al.  From Paraphrase Database to Compositional Paraphrase Model and Back , 2015, Transactions of the Association for Computational Linguistics.

[33]  Fraser,et al.  Independent coordinates for strange attractors from mutual information. , 1986, Physical review. A, General physics.

[34]  Yannis Stavrakas,et al.  Multivariate Gaussian Document Representation from Word Embeddings for Text Categorization , 2017, EACL.

[35]  Vitalii Zhelezniak,et al.  Don't Settle for Average, Go for the Max: Fuzzy Sets and Max-Pooled Word Vectors , 2019, ICLR.

[36]  G. V. Steeg Non-parametric Entropy Estimation Toolbox (NPEET) , 2014 .

[37]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[38]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[39]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[40]  Moon,et al.  Estimation of mutual information using kernel density estimators. , 1995, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[41]  Carsten O. Daub,et al.  The mutual information: Detecting and evaluating dependencies between variables , 2002, ECCB.

[42]  Eneko Agirre,et al.  SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity , 2012, *SEMEVAL.

[43]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[44]  Thomas Demeester,et al.  Representation learning for very short texts using weighted word embedding aggregation , 2016, Pattern Recognit. Lett..

[45]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[46]  B. Efron Better Bootstrap Confidence Intervals , 1987 .

[47]  Aram Galstyan,et al.  Information-theoretic measures of influence based on content dynamics , 2012, WSDM.

[48]  Kevin Gimpel,et al.  Pushing the Limits of Paraphrastic Sentence Embeddings with Millions of Machine Translations , 2017, ArXiv.

[49]  Pramod Viswanath,et al.  Demystifying fixed k-nearest neighbor information estimators , 2016, 2017 IEEE International Symposium on Information Theory (ISIT).

[50]  Holger Schwenk,et al.  Supervised Learning of Universal Sentence Representations from Natural Language Inference Data , 2017, EMNLP.

[51]  Christian S. Perone,et al.  Evaluation of sentence embeddings in downstream and linguistic probing tasks , 2018, ArXiv.

[52]  Geoffrey E. Hinton,et al.  Similarity of Neural Network Representations Revisited , 2019, ICML.

[53]  Alexander A. Alemi,et al.  On Variational Bounds of Mutual Information , 2019, ICML.

[54]  R. Moddemeijer On estimation of entropy and mutual information of continuous distributions , 1989 .

[55]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[56]  Bernhard Schölkopf,et al.  Measuring Statistical Dependence with Hilbert-Schmidt Norms , 2005, ALT.

[57]  Mehryar Mohri,et al.  Algorithms for Learning Kernels Based on Centered Alignment , 2012, J. Mach. Learn. Res..

[58]  Alexander A. Alemi,et al.  Deep Variational Information Bottleneck , 2017, ICLR.

[59]  Douwe Kiela,et al.  SentEval: An Evaluation Toolkit for Universal Sentence Representations , 2018, LREC.

[60]  Eneko Agirre,et al.  SemEval-2016 Task 1: Semantic Textual Similarity, Monolingual and Cross-Lingual Evaluation , 2016, *SEMEVAL.

[61]  Vitalii Zhelezniak,et al.  Correlation Coefficients and Semantic Textual Similarity , 2019, NAACL.

[62]  Yoshua Bengio,et al.  Mutual Information Neural Estimation , 2018, ICML.

[63]  Fabio A. González,et al.  Soft Cardinality in Semantic Text Processing: Experience of the SemEval International Competitions , 2015, Polytech. Open Libr. Int. Bull. Inf. Technol. Sci..