论文信息 - Reconsidering Cross-lingual Word Embeddings

Reconsidering Cross-lingual Word Embeddings

While cross-lingual word embeddings have been studied exte nsiv ly in recent years, the qualitative differences between the different algorithms remains vague. We observe that whether or not an algorithm uses a particular feature set (sentence IDs) ac counts for a significant performance gap among these algorithms. This feature set is also used by t raditional alignment algorithms, such as IBM Model-1, which demonstrate similar performance to state-of-the-art embedding algorithms on a variety of benchmarks. Overall, we observe tha t different algorithmic approaches for utilizing the sentence ID feature space result in simila r performance. This paper draws both empirical and theoretical parallels between the embedding and alignment literature, and suggests that adding additional sources of information, which go beyond the traditional signal of bilingual sentence-aligned corpora, is an appealing appro ach for substantially improving crosslingual word embeddings.

Omer Levy | Anders Søgaard | Yoav Goldberg

[1] Christopher D. Manning,et al. Bilingual Word Embeddings for Phrase-Based Machine Translation , 2013, EMNLP.

[2] Miles Osborne,et al. Statistical Machine Translation , 2010, Encyclopedia of Machine Learning and Data Mining.

[3] Philipp Koehn,et al. Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[4] Hugo Larochelle,et al. An Autoencoder Approach to Learning Bilingual Word Representations , 2014, NIPS.

[5] Mark Steedman,et al. A massively parallel corpus: the Bible in 100 languages , 2014, Lang. Resour. Evaluation.

[6] Phil Blunsom,et al. Multilingual Distributed Representations without Word Alignment , 2013, ICLR 2014.

[7] Anders Søgaard,et al. Simple task-specific bilingual word embeddings , 2015, NAACL.

[8] José B. Mariño,et al. Guidelines for Word Alignment Evaluation and Manual Alignment , 2005, Lang. Resour. Evaluation.

[9] Yoshua Bengio,et al. BilBOWA: Fast Bilingual Distributed Representations without Word Alignments , 2014, ICML.

[10] Hermann Ney,et al. A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[11] Patrick Pantel,et al. From Frequency to Meaning: Vector Space Models of Semantics , 2010, J. Artif. Intell. Res..