Improving the precision of automatically constructed human-oriented translation dictionaries

In this paper we address the problem of automatic acquisition of a human-oriented translation dictionary from a large-scale parallel corpus. The initial translation equivalents can be extracted with the help of the techniques and tools developed for the phrase-table construction in statistical machine translation. The acquired translation equivalents usually provide good lexicon coverage, but they also contain a large amount of noise. We propose a supervised learning algorithm for the detection of noisy translations, which takes into account the context and syntax features, averaged over the sentences in which a given phrase pair occurred. Across nine European language pairs the number of serious translation errors is reduced by 43.2%, compared to a baseline which uses only phrase-level statistics.

[1]  I. Dan Melamed Automatic Construction of Clean Broad-Coverage Translation Lexicons , 1996, AMTA.

[2]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[3]  Sumithra Velupillai,et al.  Finding the Parallel : Automatic Dictionary Construction and Identification of Parallel Text Pairs , 2008 .

[4]  Iadh Ounis,et al.  Building Bilingual Dictionaries from Parallel Web Documents , 2002, ECIR.

[5]  Hermann Ney,et al.  The Alignment Template Approach to Statistical Machine Translation , 2004, CL.

[6]  Wang Ling,et al.  Entropy-based Pruning for Phrase-based Machine Translation , 2012, EMNLP.

[7]  Dan Tufis,et al.  Computational bilingual lexicography: automatic extraction of translation dictionaries , 2001 .

[8]  Carita Paradis,et al.  What a corpus-based dictionary , 2006 .

[9]  Alexander H. Waibel,et al.  Translation Model Pruning via Usage Statistics for Statistical Machine Translation , 2007, HLT-NAACL.

[10]  Hideki Hirakawa,et al.  Building An MT Dictionary From Parallel Texts Based On Linguistic And Statistical Information , 1994, COLING.

[11]  I. Dan Melamed,et al.  Models of translation equivalence among words , 2000, CL.

[12]  Joel D. Martin,et al.  Improving Translation Quality by Discarding Most of the Phrasetable , 2007, EMNLP.

[13]  Philipp Koehn,et al.  Dirt Cheap Web-Scale Parallel Text from the Common Crawl , 2013, ACL.

[14]  Pascale Fung,et al.  A statistical view on bilingual lexicon extraction , 1998, AMTA.

[15]  Noah A. Smith,et al.  The Web as a Parallel Corpus , 2003, CL.

[16]  Serge Sharoff,et al.  Harnessing the lawless: using comparable corpora to find translation equivalents , 2007 .

[17]  Magnus Sahlgren Automatic Bilingual Lexicon Acquisition Using Random Indexing of Aligned Bilingual Data , 2004, LREC.

[18]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[19]  R. R. K. Hartmann,et al.  The Use of Parallel Text Corpora in the Generation of Translation Equivalents for Bilingual Lexicography , 1994 .

[20]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[21]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[22]  Peng Xu,et al.  A Systematic Comparison of Phrase Table Pruning Techniques , 2012, EMNLP.

[23]  Hermann Ney,et al.  Improved Statistical Alignment Models , 2000, ACL.

[24]  Robert C. Moore On Log-Likelihood-Ratios and the Significance of Rare Events , 2004, EMNLP.

[25]  Jean V ronis Parallel Text Processing: Alignment and Use of Translation Corpora , 2002 .

[26]  Reinhard Rapp,et al.  Identifying Word Translations in Non-Parallel Texts , 1995, ACL.

[27]  Pablo Gamallo Otero Learning bilingual lexicons from comparable English and Spanish corpora , 2007, MTSUMMIT.