Word Embedding Algorithms as Generalized Low Rank Models and their Canonical Form

Word embedding algorithms produce very reliable feature representations of words that are used by neural network models across a constantly growing multitude of NLP tasks. As such, it is imperative for NLP practitioners to understand how their word representations are produced, and why they are so impactful. The present work presents the Simple Embedder framework, generalizing the state-of-the-art existing word embedding algorithms (including Word2vec (SGNS) and GloVe) under the umbrella of generalized low rank models. We derive that both of these algorithms attempt to produce embedding inner products that approximate pointwise mutual information (PMI) statistics in the corpus. Once cast as Simple Embedders, comparison of these models reveals that these successful embedders all resemble a straightforward maximum likelihood estimate (MLE) of the PMI parametrized by the inner product (between embeddings). This MLE induces our proposed novel word embedding model, Hilbert-MLE, as the canonical representative of the Simple Embedder framework. We empirically compare these algorithms with evaluations on 17 different datasets. Hilbert-MLE consistently observes second-best performance on every extrinsic evaluation (news classification, sentiment analysis, POS-tagging, and supersense tagging), while the first-best model depends varying on the task. Moreover, Hilbert-MLE consistently observes the least variance in results with respect to the random initialization of the weights in bidirectional LSTMs. Our empirical results demonstrate that Hilbert-MLE is a very consistent word embedding algorithm that can be reliably integrated into existing NLP systems to obtain high-quality results.

[1]  Barbara Plank,et al.  When is multitask learning effective? Semantic sequence prediction under varying data conditions , 2016, EACL.

[2]  Noam Shazeer,et al.  Swivel: Improving Embeddings by Noticing What's Missing , 2016, ArXiv.

[3]  Ehud Rivlin,et al.  Placing search in context: the concept revisited , 2002, TOIS.

[4]  David M. W. Powers,et al.  Verb similarity on the taxonomy of WordNet , 2006 .

[5]  Charles L. A. Clarke,et al.  Frequency Estimates for Statistical Word Similarity Measures , 2003, NAACL.

[6]  Ronald J. Williams,et al.  A Learning Algorithm for Continually Running Fully Recurrent Neural Networks , 1989, Neural Computation.

[7]  Wenhu Chen,et al.  How Large a Vocabulary Does Text Classification Need? A Variational Approach to Vocabulary Selection , 2019, NAACL.

[8]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[9]  Yann Dauphin,et al.  Language Modeling with Gated Convolutional Networks , 2016, ICML.

[10]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[11]  Erik Cambria,et al.  Recent Trends in Deep Learning Based Natural Language Processing , 2017, IEEE Comput. Intell. Mag..

[12]  Michael N. Jones,et al.  Querying Word Embeddings for Similarity and Relatedness , 2018, NAACL-HLT.

[13]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[14]  Jackie Chi Kit Cheung,et al.  Verb Phrase Ellipsis Resolution Using Discriminative and Margin-Infused Algorithms , 2016, EMNLP.

[15]  Ken McRae,et al.  Thematic relatedness production norms for 100 object concepts , 2016, Behavior research methods.

[16]  Jackie Chi Kit Cheung,et al.  Resolving Event Coreference with Supervised Representation Learning and Clustering-Oriented Regularization , 2018, *SEM@NAACL-HLT.

[17]  Yee Whye Teh,et al.  A fast and simple algorithm for training neural probabilistic language models , 2012, ICML.

[18]  Yingyu Liang,et al.  Generalizing Word Embeddings using Bag of Subwords , 2018, EMNLP.

[19]  Roberto Navigli,et al.  Neural Sequence Learning Models for Word Sense Disambiguation , 2017, EMNLP.

[20]  Stephen P. Boyd,et al.  Generalized Low Rank Models , 2014, Found. Trends Mach. Learn..

[21]  Satoshi Matsuoka,et al.  Analogy-based detection of morphological and semantic relations with word embeddings: what works and what doesn’t. , 2016, NAACL.

[22]  Gemma Boleda,et al.  Distributional Semantics in Technicolor , 2012, ACL.

[23]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[24]  Percy Liang,et al.  Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.

[25]  Omer Levy,et al.  Neural Word Embedding as Implicit Matrix Factorization , 2014, NIPS.

[26]  Wenpeng Yin,et al.  Learning Word Meta-Embeddings , 2016, ACL.

[27]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[28]  Hideki Nakayama,et al.  Compressing Word Embeddings via Deep Compositional Code Learning , 2017, ICLR.

[29]  Steven Skiena,et al.  DeepWalk: online learning of social representations , 2014, KDD.

[30]  Uri Alon,et al.  code2vec: learning distributed representations of code , 2018, Proc. ACM Program. Lang..

[31]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[32]  Yulia Tsvetkov,et al.  Problems With Evaluation of Word Embeddings Using Word Similarity Tasks , 2016, RepEval@ACL.

[33]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[34]  Koray Kavukcuoglu,et al.  Learning word embeddings efficiently with noise-contrastive estimation , 2013, NIPS.

[35]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[36]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[37]  Guoyin Wang,et al.  Baseline Needs More Love: On Simple Word-Embedding-Based Models and Associated Pooling Mechanisms , 2018, ACL.

[38]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[39]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[40]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[42]  Gonçalo Simões,et al.  Morphosyntactic Tagging with a Meta-BiLSTM Model over Context Sensitive Token Encodings , 2018, ACL.

[43]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[44]  Derek Ruths,et al.  Sentiment Analysis: It’s Complicated! , 2018, NAACL.

[45]  Bofang Li,et al.  The (too Many) Problems of Analogical Reasoning with Word Vectors , 2017, *SEMEVAL.

[46]  Sandiway Fong,et al.  Natural language grammatical inference: a comparison of recurrent neural networks and machine learning methods , 1995, Learning for Natural Language Processing.

[47]  Xiaoyong Du,et al.  Investigating Different Syntactic Context Types and Context Representations for Learning Word Embeddings , 2017, EMNLP.

[48]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[49]  Christopher D. Manning,et al.  Better Word Representations with Recursive Neural Networks for Morphology , 2013, CoNLL.

[50]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[51]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[52]  Geoffrey Zweig,et al.  Recurrent conditional random field for language understanding , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[53]  Curt Burgess,et al.  Producing high-dimensional semantic spaces from lexical co-occurrence , 1996 .

[54]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[55]  Mark J. F. Gales,et al.  Recurrent neural network language model training with noise contrastive estimation for speech recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[56]  Graham Neubig,et al.  When and Why Are Pre-Trained Word Embeddings Useful for Neural Machine Translation? , 2018, NAACL.

[57]  Tal Linzen,et al.  Issues in evaluating semantic spaces using word analogies , 2016, RepEval@ACL.

[58]  Yoav Goldberg,et al.  A Primer on Neural Network Models for Natural Language Processing , 2015, J. Artif. Intell. Res..

[59]  David Vandyke,et al.  Counter-fitting Word Vectors to Linguistic Constraints , 2016, NAACL.

[60]  Eneko Agirre,et al.  A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches , 2009, NAACL.

[61]  Masaaki Nagata,et al.  A Unified Learning Framework of Skip-Grams and Global Vectors , 2015, ACL.

[62]  Amir Bakarov,et al.  A Survey of Word Embeddings Evaluation Methods , 2018, ArXiv.

[63]  Douglas L. T. Rohde,et al.  An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence , 2005 .

[64]  Iryna Gurevych,et al.  Supersense Embeddings: A Unified Model for Supersense Interpretation, Prediction, and Utilization , 2016, ACL.

[65]  Zhiyuan Liu,et al.  Linking GloVe with word2vec , 2014, ArXiv.

[66]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[67]  Moustapha Cissé,et al.  Efficient softmax approximation for GPUs , 2016, ICML.

[68]  Anders Søgaard,et al.  Deep multi-task learning with low level tasks supervised at lower layers , 2016, ACL.

[69]  Ronan Collobert,et al.  Word Embeddings through Hellinger PCA , 2013, EACL.

[70]  George A. Miller,et al.  A Semantic Concordance , 1993, HLT.

[71]  Ming Zhou,et al.  Learning Sentiment-Specific Word Embedding for Twitter Sentiment Classification , 2014, ACL.

[72]  Wei Xu,et al.  Bidirectional LSTM-CRF Models for Sequence Tagging , 2015, ArXiv.

[73]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[74]  Luke S. Zettlemoyer,et al.  End-to-end Neural Coreference Resolution , 2017, EMNLP.

[75]  Omer Levy,et al.  Improving Distributional Similarity with Lessons Learned from Word Embeddings , 2015, TACL.

[76]  Slav Petrov,et al.  A Universal Part-of-Speech Tagset , 2011, LREC.

[77]  SangKeun Lee,et al.  Learning to Generate Word Representations using Subword Information , 2018, COLING.

[78]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[79]  Felix Hill,et al.  SimLex-999: Evaluating Semantic Models With (Genuine) Similarity Estimation , 2014, CL.

[80]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[81]  Yoshua Bengio,et al.  Hierarchical Probabilistic Neural Network Language Model , 2005, AISTATS.

[82]  Enhong Chen,et al.  Word Embedding Revisited: A New Representation Learning and Explicit Matrix Factorization Perspective , 2015, IJCAI.

[83]  Guillaume Lample,et al.  What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties , 2018, ACL.

[84]  J. R. Firth,et al.  A Synopsis of Linguistic Theory, 1930-1955 , 1957 .

[85]  Jackie Chi Kit Cheung,et al.  Clustering-Oriented Representation Learning with Attractive-Repulsive Loss , 2018, ArXiv.

[86]  Jian Li,et al.  Network Embedding as Matrix Factorization: Unifying DeepWalk, LINE, PTE, and node2vec , 2017, WSDM.

[87]  Hexiang Hu,et al.  Multi-Task Learning for Sequence Tagging: An Empirical Study , 2018, COLING.

[88]  Omer Levy,et al.  Linguistic Regularities in Sparse and Explicit Word Representations , 2014, CoNLL.

[89]  Sanjeev Arora,et al.  A Latent Variable Model Approach to PMI-based Word Embeddings , 2015, TACL.

[90]  Nigel Collier,et al.  SemEval-2017 Task 2: Multilingual and Cross-lingual Semantic Word Similarity , 2017, *SEMEVAL.

[91]  Jacob Eisenstein,et al.  Mimicking Word Embeddings using Subword RNNs , 2017, EMNLP.

[92]  Ashish Vaswani,et al.  Simple, Fast Noise-Contrastive Estimation for Large RNN Vocabularies , 2016, HLT-NAACL.

[93]  Paolo Torroni,et al.  Attention, please! A Critical Review of Neural Attention Models in Natural Language Processing , 2019, ArXiv.

[94]  Shuai Wang,et al.  Deep learning for sentiment analysis: A survey , 2018, WIREs Data Mining Knowl. Discov..

[95]  Curt Burgess,et al.  From simple associations to the building blocks of language: Modeling meaning in memory with the HAL model , 1998 .

[96]  Yasemin Altun,et al.  Broad-Coverage Sense Disambiguation and Information Extraction with a Supersense Sequence Tagger , 2006, EMNLP.

[97]  Sampo Pyysalo,et al.  Intrinsic Evaluation of Word Vectors Fails to Predict Extrinsic Performance , 2016, RepEval@ACL.

[98]  Jimmy J. Lin,et al.  Noise-Contrastive Estimation for Answer Selection with Deep Neural Networks , 2016, CIKM.

[99]  Kyunghyun Cho,et al.  Dynamic Meta-Embeddings for Improved Sentence Representations , 2018, EMNLP.

[100]  Anna Korhonen,et al.  An Unsupervised Model for Instance Level Subcategorization Acquisition , 2014, EMNLP.

[101]  Evgeniy Gabrilovich,et al.  A word at a time: computing word relatedness using temporal semantic analysis , 2011, WWW.

[102]  Pasquale Lops,et al.  Learning Word Embeddings from Wikipedia for Content-Based Recommender Systems , 2016, ECIR.

[103]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[104]  Karl Stratos Reconstruction of Word Embeddings from Sub-Word Parameters , 2017, SWCN@EMNLP.