Embedding Word Similarity with Neural Machine Translation

Neural language models learn word representations, or embeddings, that capture rich linguistic and conceptual information. Here we investigate the embeddings learned by neural machine translation models, a recently-developed class of neural language model. We show that embeddings from translation models outperform those learned by monolingual models at tasks that require knowledge of both conceptual similarity and lexical-syntactic role. We further show that these effects hold when translating from both English to French and English to German, and argue that the desirable properties of translation embeddings should emerge largely independently of the source and target languages. Finally, we apply a new method for training neural translation models with very large vocabularies, and show that this vocabulary expansion algorithm results in minimal degradation of embedding quality. Our embedding spaces can be queried in an online demo and downloaded from our web page. Overall, our analyses indicate that translation-based embeddings should be used in applications that require concepts to be organised according to similarity and/or lexical function, while monolingual embeddings are better suited to modelling (nonspecific) inter-word relatedness.

[1]  Manaal Faruqui,et al.  Improving Vector Space Word Representations Using Multilingual Correlation , 2014, EACL.

[2]  Phil Blunsom,et al.  Learning Bilingual Word Representations by Marginalizing Alignments , 2014, ACL.

[3]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[4]  Quoc V. Le,et al.  Exploiting Similarities among Languages for Machine Translation , 2013, ArXiv.

[5]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[6]  Xinyun Chen Under Review as a Conference Paper at Iclr 2017 Delving into Transferable Adversarial Ex- Amples and Black-box Attacks , 2016 .

[7]  Eneko Agirre,et al.  A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches , 2009, NAACL.

[8]  Hugo Larochelle,et al.  An Autoencoder Approach to Learning Bilingual Word Representations , 2014, NIPS.

[9]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[10]  Felix Hill,et al.  SimLex-999: Evaluating Semantic Models With (Genuine) Similarity Estimation , 2014, CL.

[11]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[12]  Peter D. Turney Domain and Function: A Dual-Space Model of Semantic Relations and Compositions , 2012, J. Artif. Intell. Res..

[13]  Learning Abstract Concepts from Multi-Modal Data: Since You Probably Can’t See What I Mean , 2014 .

[14]  Thomas A. Schreiber,et al.  The University of South Florida free association, rhyme, and word fragment norms , 2004, Behavior research methods, instruments, & computers : a journal of the Psychonomic Society, Inc.

[15]  Phil Blunsom,et al.  Multilingual Distributed Representations without Word Alignment , 2013, ICLR 2014.

[16]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[17]  Yoshua Bengio,et al.  Hierarchical Probabilistic Neural Network Language Model , 2005, AISTATS.

[18]  Ivan Titov,et al.  Inducing Crosslingual Distributed Representations of Words , 2012, COLING.

[19]  Marie-Francine Moens,et al.  Identifying Word Translations from Comparable Corpora Using Latent Topic Models , 2011, ACL.

[20]  Dan Klein,et al.  Learning Bilingual Lexicons from Monolingual Corpora , 2008, ACL.

[21]  Felix Hill,et al.  Learning Abstract Concept Embeddings from Multi-Modal Data: Since You Probably Can’t See What I Mean , 2014, EMNLP.

[22]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[23]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[24]  Yoshua Bengio,et al.  On Using Very Large Target Vocabulary for Neural Machine Translation , 2014, ACL.

[25]  Quoc V. Le,et al.  Addressing the Rare Word Problem in Neural Machine Translation , 2014, ACL.

[26]  Georgiana Dinu,et al.  Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors , 2014, ACL.

[27]  Razvan Pascanu,et al.  Theano: new features and speed improvements , 2012, ArXiv.

[28]  Patrick Pantel,et al.  From Frequency to Meaning: Vector Space Models of Semantics , 2010, J. Artif. Intell. Res..

[29]  Elia Bruni,et al.  Multimodal Distributional Semantics , 2014, J. Artif. Intell. Res..

[30]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[31]  Omer Levy,et al.  Dependency-Based Word Embeddings , 2014, ACL.

[32]  J. R. Firth,et al.  A Synopsis of Linguistic Theory, 1930-1955 , 1957 .

[33]  Blockin Blockin Quick Training of Probabilistic Neural Nets by Importance Sampling , 2003 .

[34]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[35]  Richard M. Schwartz,et al.  Fast and Robust Neural Network Joint Models for Statistical Machine Translation , 2014, ACL.

[36]  Phil Blunsom,et al.  Recurrent Continuous Translation Models , 2013, EMNLP.

[37]  Geoffrey E. Hinton,et al.  A Scalable Hierarchical Distributed Language Model , 2008, NIPS.