Embedding Learning Through Multilingual Concept Induction

We present a new method for estimating vector space representations of words: embedding learning by concept induction. We test this method on a highly parallel corpus and learn semantic representations of words in 1259 different languages in a single common space. An extensive experimental evaluation on crosslingual word similarity and sentiment analysis indicates that concept-based multilingual embedding learning performs better than previous approaches.

[1]  I. Dan Melamed A Word-to-Word Model of Translational Equivalence , 1997, ACL.

[2]  Michel Simard Text-Translation Alignment: Three Languages Are Better Than Two , 1999, EMNLP.

[3]  Jörg Tiedemann,et al.  Combining Clues for Word Alignment , 2003, EACL.

[4]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[5]  Harold L. Somers,et al.  Round-trip Translation: What Is It Good For? , 2005, ALTA.

[6]  Yves Lepage,et al.  The contribution of the notion of hapax legomena to word alignment , 2007 .

[7]  Philip Resnik,et al.  Cross-Language Parser Adaptation between Related Languages , 2008, IJCNLP.

[8]  Hinrich Schütze,et al.  Word Alignment by Thresholded Two-Dimensional Normalization , 2009, MTSUMMIT.

[9]  The Efficacy of Round-trip Translation for MT Evaluation , 2010 .

[10]  Wiebke Wagner,et al.  Steven Bird, Ewan Klein and Edward Loper: Natural Language Processing with Python, Analyzing Text with the Natural Language Toolkit , 2010, Lang. Resour. Evaluation.

[11]  Slav Petrov,et al.  Multi-Source Transfer of Delexicalized Dependency Parsers , 2011, EMNLP.

[12]  Marie-Francine Moens,et al.  Sub-corpora Sampling with an Application to Bilingual Lexicon Extraction , 2012, COLING.

[13]  Marie-Francine Moens,et al.  Detecting Highly Confident Word Translations from Comparable Corpora without Any Prior Knowledge , 2012, EACL.

[14]  Ivan Titov,et al.  Inducing Crosslingual Distributed Representations of Words , 2012, COLING.

[15]  Quoc V. Le,et al.  Exploiting Similarities among Languages for Machine Translation , 2013, ArXiv.

[16]  Christopher D. Manning,et al.  Bilingual Word Embeddings for Phrase-Based Machine Translation , 2013, EMNLP.

[17]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[18]  Noah A. Smith,et al.  A Simple, Fast, and Effective Reparameterization of IBM Model 2 , 2013, NAACL.

[19]  Phil Blunsom,et al.  Multilingual Distributed Representations without Word Alignment , 2013, ICLR 2014.

[20]  Phil Blunsom,et al.  Multilingual Models for Compositional Distributed Semantics , 2014, ACL.

[21]  Min Xiao,et al.  Distributed Word Representation Learning for Cross-Lingual Dependency Parsing , 2014, CoNLL.

[22]  Hugo Larochelle,et al.  An Autoencoder Approach to Learning Bilingual Word Representations , 2014, NIPS.

[23]  Robert Östling,et al.  Bayesian Word Alignment for Massively Parallel Texts , 2014, EACL.

[24]  Manaal Faruqui,et al.  Improving Vector Space Word Representations Using Multilingual Correlation , 2014, EACL.

[25]  Thomas Mayer,et al.  Creating a massively parallel Bible corpus , 2014, LREC.

[26]  Yulia Tsvetkov,et al.  Metaphor Detection with Cross-Lingual Model Transfer , 2014, ACL.

[27]  Barbara Plank,et al.  Inverted indexing for cross-lingual NLP , 2015, ACL.

[28]  Marie-Francine Moens,et al.  Bilingual Word Embeddings from Non-Parallel Document-Aligned Data Applied to Bilingual Lexicon Induction , 2015, ACL.

[29]  Christopher D. Manning,et al.  Bilingual Word Representations with Monolingual Quality in Mind , 2015, VS@HLT-NAACL.

[30]  Yoshua Bengio,et al.  BilBOWA: Fast Bilingual Distributed Representations without Word Alignments , 2014, ICML.

[31]  Oriol Vinyals,et al.  Multilingual Language Processing From Bytes , 2015, NAACL.

[32]  Barbara Plank,et al.  Multilingual Projection for Parsing Truly Low-Resource Languages , 2016, TACL.

[33]  Guillaume Lample,et al.  Massively Multilingual Word Embeddings , 2016, ArXiv.

[34]  M. Utiyama,et al.  A Novel Bilingual Word Embedding Method for Lexical Translation Using Bilingual Sense Clique , 2016, ArXiv.

[35]  Jörg Tiedemann,et al.  Efficient Word Alignment with Markov Chain Monte Carlo , 2016, Prague Bull. Math. Linguistics.

[36]  David Yarowsky,et al.  A Representation Learning Framework for Multi-Source Transfer Parsing , 2016, AAAI.

[37]  Manaal Faruqui,et al.  Cross-lingual Models of Word Embeddings: An Empirical Comparison , 2016, ACL.

[38]  Omer Levy,et al.  A Strong Baseline for Learning Cross-Lingual Word Embeddings from Sentence Alignments , 2016, EACL.

[39]  Jörg Tiedemann,et al.  Continuous multilinguality with language vectors , 2016, EACL.

[40]  Hiroshi Kanayama,et al.  Multilingual Training of Crosslingual Word Embeddings , 2017, EACL.

[41]  Graham Neubig,et al.  Learning Language Representations for Typology Prediction , 2017, EMNLP.

[42]  Hinrich Schütze,et al.  Past, Present, Future: A Computational Investigation of the Typology of Tense in 1000 Languages , 2017, EMNLP.

[43]  Marie-Francine Moens,et al.  Bilingual Lexicon Induction by Learning to Combine Word-Level and Character-Level Representations , 2017, EACL.

[44]  Jörg Tiedemann,et al.  Emerging Language Spaces Learned From Massively Multilingual Corpora , 2018, DHN.