Reconsidering Cross-lingual Word Embeddings

While cross-lingual word embeddings have been studied exte nsiv ly in recent years, the qualitative differences between the different algorithms remains vague. We observe that whether or not an algorithm uses a particular feature set (sentence IDs) ac counts for a significant performance gap among these algorithms. This feature set is also used by t raditional alignment algorithms, such as IBM Model-1, which demonstrate similar performance to state-of-the-art embedding algorithms on a variety of benchmarks. Overall, we observe tha t different algorithmic approaches for utilizing the sentence ID feature space result in simila r performance. This paper draws both empirical and theoretical parallels between the embedding and alignment literature, and suggests that adding additional sources of information, which go beyond the traditional signal of bilingual sentence-aligned corpora, is an appealing appro ach for substantially improving crosslingual word embeddings.

[1]  Christopher D. Manning,et al.  Bilingual Word Embeddings for Phrase-Based Machine Translation , 2013, EMNLP.

[2]  Miles Osborne,et al.  Statistical Machine Translation , 2010, Encyclopedia of Machine Learning and Data Mining.

[3]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[4]  Hugo Larochelle,et al.  An Autoencoder Approach to Learning Bilingual Word Representations , 2014, NIPS.

[5]  Mark Steedman,et al.  A massively parallel corpus: the Bible in 100 languages , 2014, Lang. Resour. Evaluation.

[6]  Phil Blunsom,et al.  Multilingual Distributed Representations without Word Alignment , 2013, ICLR 2014.

[7]  Anders Søgaard,et al.  Simple task-specific bilingual word embeddings , 2015, NAACL.

[8]  José B. Mariño,et al.  Guidelines for Word Alignment Evaluation and Manual Alignment , 2005, Lang. Resour. Evaluation.

[9]  Yoshua Bengio,et al.  BilBOWA: Fast Bilingual Distributed Representations without Word Alignments , 2014, ICML.

[10]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[11]  Patrick Pantel,et al.  From Frequency to Meaning: Vector Space Models of Semantics , 2010, J. Artif. Intell. Res..

[12]  Christopher D. Manning,et al.  Bilingual Word Representations with Monolingual Quality in Mind , 2015, VS@HLT-NAACL.

[13]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[14]  Marie-Francine Moens,et al.  Bilingual Distributed Word Representations from Document-Aligned Comparable Data , 2015, J. Artif. Intell. Res..

[15]  Omer Levy,et al.  Neural Word Embedding as Implicit Matrix Factorization , 2014, NIPS.

[16]  Barbara Plank,et al.  Inverted indexing for cross-lingual NLP , 2015, ACL.

[17]  Omer Levy,et al.  Improving Distributional Similarity with Lessons Learned from Word Embeddings , 2015, TACL.

[18]  Min Xiao,et al.  Distributed Word Representation Learning for Cross-Lingual Dependency Parsing , 2014, CoNLL.

[19]  Ivan Titov,et al.  Inducing Crosslingual Distributed Representations of Words , 2012, COLING.

[20]  Quoc V. Le,et al.  Exploiting Similarities among Languages for Machine Translation , 2013, ArXiv.

[21]  João Graça,et al.  Building a Golden Collection of Parallel Multi-Language Word Alignment , 2008, LREC.

[22]  Gülsen Eryigit,et al.  Word Alignment for English-Turkish Language Pair , 2012, LREC.

[23]  Lars Ahrenberg,et al.  A Gold Standard for English-Swedish Word Alignment , 2011, NODALIDA.

[24]  Manaal Faruqui,et al.  Improving Vector Space Word Representations Using Multilingual Correlation , 2014, EACL.

[25]  Ted Pedersen,et al.  An Evaluation Exercise for Word Alignment , 2003, ParallelTexts@NAACL-HLT.

[26]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.