Leveraging Meta-Embeddings for Bilingual Lexicon Extraction from Specialized Comparable Corpora

Recent evaluations on bilingual lexicon extraction from specialized comparable corpora have shown contrasted performance while using word embedding models. This can be partially explained by the lack of large specialized comparable corpora to build efficient representations. Within this context, we try to answer the following questions: First, (i) among the state-of-the-art embedding models, whether trained on specialized corpora or pre-trained on large general data sets, which one is the most appropriate model for bilingual terminology extraction? Second (ii) is it worth it to combine multiple embeddings trained on different data sets? For that purpose, we propose the first systematic evaluation of different word embedding models for bilingual terminology extraction from specialized comparable corpora. We emphasize how the character-based embedding model outperforms other models on the quality of the extracted bilingual lexicons. Further more, we propose a new efficient way to combine different embedding models learned from specialized and general-domain data sets. Our approach leads to higher performance than the best individual embedding model.

[1]  Gregory Grefenstette,et al.  Cross-Language Information Retrieval , 1998, The Springer International Series on Information Retrieval.

[2]  Clément de Groc Babouk: Focused Web Crawling for Corpus Compilation and Automatic Terminology Extraction , 2011, 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology.

[3]  Pierre Zweigenbaum,et al.  Looking for Candidate Translational Equivalents in Specialized, Comparable Corpora , 2002, COLING.

[4]  Eneko Agirre,et al.  Learning principled bilingual mappings of word embeddings while preserving monolingual invariance , 2016, EMNLP.

[5]  Pablo Gamallo Otero Learning bilingual lexicons from comparable English and Spanish corpora , 2007, MTSUMMIT.

[6]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[7]  Kenji Sagae,et al.  Combining Distributed Vector Representations for Words , 2015, VS@HLT-NAACL.

[8]  Pierre Zweigenbaum,et al.  Context Vector Disambiguation for Bilingual Lexicon Extraction from Comparable Corpora , 2013, ACL.

[9]  Wenpeng Yin,et al.  Learning Word Meta-Embeddings , 2016, ACL.

[10]  Emmanuel Morin,et al.  Bilingual Word Embeddings for Bilingual Terminology Extraction from Specialized Comparable Corpora , 2017, IJCNLP.

[11]  Martti Juhola,et al.  Creating and exploiting a comparable corpus in cross-language information retrieval , 2007, TOIS.

[12]  Hugo Larochelle,et al.  An Autoencoder Approach to Learning Bilingual Word Representations , 2014, NIPS.

[13]  Marie-Francine Moens,et al.  Bilingual Word Embeddings from Non-Parallel Document-Aligned Data Applied to Bilingual Lexicon Induction , 2015, ACL.

[14]  Manaal Faruqui,et al.  Improving Vector Space Word Representations Using Multilingual Correlation , 2014, EACL.

[15]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[16]  Dong Wang,et al.  Normalized Word Embedding and Orthogonal Transform for Bilingual Word Translation , 2015, NAACL.

[17]  Jean-Michel Renders,et al.  A Geometric View on Bilingual Lexicon Extraction from Comparable Corpora , 2004, ACL.

[18]  Wang Ling,et al.  Two/Too Simple Adaptations of Word2Vec for Syntax Problems , 2015, NAACL.

[19]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[20]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[21]  Amir HAZEM,et al.  ICA for Bilingual Lexicon Extraction from Comparable Corpora , 2012 .

[22]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[23]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[24]  Emmanuel Morin,et al.  Looking at Unbalanced Specialized Comparable Corpora for Bilingual Lexicon Extraction , 2014, ACL.

[25]  Estelle Maryline Delpech Comparable Corpora and Computer-Assisted Translation: Delpech/Comparable Corpora and Computer-Assisted Translation , 2014 .

[26]  Pascale Fung,et al.  Rare Word Translation Extraction from Aligned Comparable Documents , 2011, ACL.

[27]  Yoshua Bengio,et al.  BilBOWA: Fast Bilingual Distributed Representations without Word Alignments , 2014, ICML.

[28]  Pascale Fung,et al.  A Statistical View on Bilingual Lexicon Extraction: From Parallel Corpora to Non-parallel Corpora , 1998, AMTA.

[29]  Philippe Langlais,et al.  Reranking Translation Candidates Produced by Several Bilingual Word Similarity Sources , 2017, EACL.

[30]  Reinhard Rapp,et al.  Automatic Identification of Word Translations from Unrelated English and German Corpora , 1999, ACL.

[31]  Marie-Francine Moens,et al.  Bilingual Distributed Word Representations from Document-Aligned Comparable Data , 2015, J. Artif. Intell. Res..

[32]  Pascale Fung,et al.  Finding Terminology Translations from Non-parallel Corpora , 1997, VLC.

[33]  Omer Levy,et al.  Dependency-Based Word Embeddings , 2014, ACL.

[34]  Quoc V. Le,et al.  Exploiting Similarities among Languages for Machine Translation , 2013, ArXiv.

[35]  Kyo Kageura,et al.  Bilingual Terminology Mining - Using Brain, not brawn comparable corpora , 2007, ACL.

[36]  Suresh Manandhar,et al.  Bilingual lexicon extraction from comparable corpora using in-domain terms , 2010, COLING.

[37]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[38]  Phil Blunsom,et al.  Multilingual Models for Compositional Distributed Semantics , 2014, ACL.

[39]  Xiaoyong Du,et al.  Investigating Different Syntactic Context Types and Context Representations for Learning Word Embeddings , 2017, EMNLP.