Efficient Data Selection for Bilingual Terminology Extraction from Comparable Corpora

Comparable corpora are the main alternative to the use of parallel corpora to extract bilingual lexicons. Although it is easier to build comparable corpora, specialized comparable corpora are often of modest size in comparison with corpora issued from the general domain. Consequently, the observations of word co-occurrences which are the basis of context-based methods are unreliable. We propose in this article to improve word co-occurrences of specialized comparable corpora and thus context representation by using general-domain data. This idea, which has been already used in machine translation task for more than a decade, is not straightforward for the task of bilingual lexicon extraction from specific-domain comparable corpora. We go against the mainstream of this task where many studies support the idea that adding out-of-domain documents decreases the quality of lexicons. Our empirical evaluation shows the advantages of this approach which induces a significant gain in the accuracy of extracted lexicons.

[1]  Stefan Evert,et al.  The Statistics of Word Cooccur-rences: Word Pairs and Collocations , 2004 .

[2]  Michael E. Lesk,et al.  Computer Evaluation of Indexing and Text Processing , 1968, JACM.

[3]  Amir HAZEM,et al.  ICA for Bilingual Lexicon Extraction from Comparable Corpora , 2012 .

[4]  Robert Dale,et al.  United Nations General Assembly Resolutions : a six-language parallel corpus , 2009 .

[5]  Jean-Michel Renders,et al.  A Geometric View on Bilingual Lexicon Extraction from Comparable Corpora , 2004, ACL.

[6]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[7]  Martti Juhola,et al.  Creating and exploiting a comparable corpus in cross-language information retrieval , 2007, TOIS.

[8]  Hugo Larochelle,et al.  An Autoencoder Approach to Learning Bilingual Word Representations , 2014, NIPS.

[9]  Éric Gaussier,et al.  Improving Corpus Comparability for Bilingual Lexicon Extraction from Comparable Corpora , 2010, COLING.

[10]  Kyo Kageura,et al.  Brains, not brawn: The use of “smart” comparable corpora in bilingual terminology mining , 2008, TSLP.

[11]  Marie-Francine Moens,et al.  Cross-Lingual Semantic Similarity of Words as the Similarity of Their Semantic Word Responses , 2013, NAACL.

[12]  Yoshua Bengio,et al.  BilBOWA: Fast Bilingual Distributed Representations without Word Alignments , 2014, ICML.

[13]  Marie-Francine Moens,et al.  Bilingual Distributed Word Representations from Document-Aligned Comparable Data , 2015, J. Artif. Intell. Res..

[14]  Marie-Francine Moens,et al.  A Study on Bootstrapping Bilingual Vector Spaces from Non-Parallel Data (and Nothing Else) , 2013, EMNLP.

[15]  Emmanuel Morin,et al.  Word Co-occurrence Counts Prediction for Bilingual Terminology Extraction from Comparable Corpora , 2013, IJCNLP.

[16]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[17]  Pierre Zweigenbaum,et al.  Looking for Candidate Translational Equivalents in Specialized, Comparable Corpora , 2002, COLING.

[18]  William D. Lewis,et al.  Intelligent Selection of Language Model Training Data , 2010, ACL.

[19]  Pascale Fung,et al.  Compiling Bilingual Lexicon Entries From a Non-Parallel English-Chinese Corpus , 1995, VLC@ACL.

[20]  Lidia S. Chao,et al.  A Systematic Comparison of Data Selection Criteria for SMT Domain Adaptation , 2014, TheScientificWorldJournal.

[21]  Reinhard Rapp,et al.  Automatic Identification of Word Translations from Unrelated English and German Corpora , 1999, ACL.

[22]  Pascale Fung,et al.  Rare Word Translation Extraction from Aligned Comparable Documents , 2011, ACL.

[23]  Marie-Francine Moens,et al.  Bilingual Word Embeddings from Non-Parallel Document-Aligned Data Applied to Bilingual Lexicon Induction , 2015, ACL.

[24]  Marie-Francine Moens,et al.  Identifying Word Translations from Comparable Corpora Using Latent Topic Models , 2011, ACL.

[25]  Pierre Zweigenbaum,et al.  Building Specialized Bilingual Lexicons Using Large Scale Background Knowledge , 2013, EMNLP.

[26]  Emmanuel Morin,et al.  Exploiting unbalanced specialized comparable corpora for bilingual lexicon extraction† , 2016, Natural Language Engineering.

[27]  Gregory Grefenstette,et al.  Explorations in automatic thesaurus discovery , 1994 .

[28]  Clément de Groc Babouk: Focused Web Crawling for Corpus Compilation and Automatic Terminology Extraction , 2011, 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology.

[29]  Jianfeng Gao,et al.  Domain Adaptation via Pseudo In-Domain Data Selection , 2011, EMNLP.

[30]  Suresh Manandhar,et al.  Bilingual lexicon extraction from comparable corpora using in-domain terms , 2010, COLING.

[31]  Pierre Zweigenbaum,et al.  Context Vector Disambiguation for Bilingual Lexicon Extraction from Comparable Corpora , 2013, ACL.

[32]  Kyo Kageura,et al.  Anchor Points for Bilingual Lexicon Extraction from Small Comparable Corpora , 2009, MTSUMMIT.

[33]  G. A. Barnard,et al.  Transmission of Information: A Statistical Theory of Communications. , 1961 .

[34]  Claire Lemaire,et al.  Extraction of Domain-Specific Bilingual Lexicon from Comparable Corpora: Compositional Translation and Ranking , 2012, COLING.