Bilingual Lexicon Extraction from Comparable Corpora Enhanced with Parallel Corpora

In this article, we present a simple and effective approach for extracting bilingual lexicon from comparable corpora enhanced with parallel corpora. We make use of structural characteristics of the documents comprising the comparable corpus to extract parallel sentences with a high degree of quality. We then use state-of-the-art techniques to build a specialized bilingual lexicon from these sentences and evaluate the contribution of this lexicon when added to the comparable corpus-based alignment technique. Finally, the value of this approach is demonstrated by the improvement of translation accuracy for medical words.

[1]  Holger Schwenk,et al.  On the Use of Comparable Corpora to Improve SMT performance , 2009, EACL.

[2]  Jian-Yun Nie,et al.  Parallel Web text mining for cross-language IR , 2000, RIAO.

[3]  Dragos Stefan Munteanu,et al.  Improving Machine Translation Performance by Exploiting Non-Parallel Corpora , 2005, CL.

[4]  Pascale Fung,et al.  A Statistical View on Bilingual Lexicon Extraction: From Parallel Corpora to Non-parallel Corpora , 1998, AMTA.

[5]  Noah A. Smith,et al.  The Web as a Parallel Corpus , 2003, CL.

[6]  Emmanuel Morin,et al.  French-English Terminology Extraction from Comparable Corpora , 2005, IJCNLP.

[7]  J.-M. Lange,et al.  Modèles statistiques pour l'extraction de lexiques bilingues , 1995 .

[8]  Pierre Zweigenbaum,et al.  The Effect of a General Lexicon in Corpus-Based Identification of French-English Medical Word Translations , 2003, MIE.

[9]  Pascale Fung,et al.  Mining Very-Non-Parallel Corpora: Parallel Sentence and Lexicon Extraction via Bootstrapping and E , 2004, EMNLP.

[10]  Hang Li,et al.  Base Noun Phrase Translation Using Web Data and the EM Algorithm , 2002, COLING.

[11]  Philippe Langlais,et al.  Revisiting Context-based Projection Methods for Term-Translation Spotting in Comparable Corpora , 2010, COLING.

[12]  Christopher C. Yang,et al.  Automatic construction of English/Chinese parallel corpora , 2003, J. Assoc. Inf. Sci. Technol..

[13]  Eric Gaussier,et al.  Une nouvelle approche à l'extraction de lexiques bilingues à partir de corpus comparables , 2007 .

[14]  Fatiha Sadat,et al.  An Approach Based on Multilingual Thesauri and Model Combination for Bilingual Lexicon Extraction , 2002, COLING.

[15]  MarcuDaniel,et al.  Improving Machine Translation Performance by Exploiting Non-Parallel Corpora , 2005 .

[16]  Xiaoyi Ma,et al.  BITS: a method for bilingual text search over the Web , 1999, MTSUMMIT.

[17]  Pierre Zweigenbaum,et al.  Looking for Candidate Translational Equivalents in Specialized, Comparable Corpora , 2002, COLING.

[18]  Michael Carl,et al.  An Intelligent Terminology Database as a Pre-processor for Statistical Machine Translation , 2002, COLING 2002.

[19]  Éric Gaussier,et al.  Towards Automatic Extraction of Monolingual and Bilingual Terminology , 1994, COLING.

[20]  Jörg Tiedemann Recycling Translations : Extraction of Lexical Data from Parallel Corpora and their Application in Natural Language Processing , 2003 .

[21]  Kyo Kageura,et al.  Bilingual Terminology Mining - Using Brain, not brawn comparable corpora , 2007, ACL.

[22]  Reinhard Rapp,et al.  Automatic Identification of Word Translations from Unrelated English and German Corpora , 1999, ACL.