Traduction automatique statistique à partir de corpus comparables : application aux couples de langues arabe-français

The present research aims to exploit comparable corpora for Statistical Machine Translation (SMT). First, a hybrid approach based on statistical and linguistics-based information is proposed for bilingual terminology extraction from Wikipedia documents. Then, we propose a hybrid approach based on length and dictionary model for the alignment of the United Nations (UN) corpus at the sentence level. In order to validate the proposed approaches, we conducted evaluations on Arabic-French SMT. We evaluation showed significant improvement in term of BLEU scores when using these two approaches as well as a pre-processing technique, on the source language (Arabic). MOTS-CLÉS : Traduction Automatique Statistique (TAS), corpus comparable, Wikipédia, arabe-français.

[1]  R. Sellami,et al.  Exploiting Wikipedia as a Knowledge Base for the Extraction of Linguistic Resources: Application on Arabic-French Comparable Corpora and Bilingual Lexicons , 2012, AMTA.

[2]  Emmanuel Morin,et al.  Bilingual Lexicon Extraction from Comparable Corpora Enhanced with Parallel Corpora , 2011, BUCC@ACL.

[3]  Andreas Eisele,et al.  MultiUN: A Multilingual Corpus from United Nation Documents , 2010, LREC.

[4]  Nizar Habash,et al.  Combination of Arabic Preprocessing Schemes for Statistical Machine Translation , 2006, ACL.

[5]  Philippe Langlais,et al.  Identifying Parallel Documents from a Large Bilingual Collection of Texts: Application to Parallel Article Extraction in Wikipedia. , 2011, BUCC@ACL.

[6]  Takahiro Hara,et al.  A Bilingual Dictionary Extracted from the Wikipedia Link Structure , 2008, DASFAA.

[7]  Vasudeva Varma,et al.  Language independent identification of parallel sentences using Wikipedia , 2011, WWW.

[8]  E. Morin,et al.  Extraction de terminologies bilingues à partir de corpus comparables , 2004, JEPTALNRECITAL.

[9]  Masatoshi Yoshikawa,et al.  Bilingual Terminology Acquisition from Comparable Corpora and Phrasal Translation to Cross-Language Information Retrieval , 2003, ACL.

[10]  Emmanuel Morin,et al.  Bilingual Lexicon Extraction from Comparable Corpora as Metasearch , 2011, BUCC@ACL.

[11]  András Kornai,et al.  Parallel corpora for medium density languages , 2007 .

[12]  Maarten de Rijke,et al.  Finding Similar Sentences across Multiple Languages in Wikipedia , 2006 .

[13]  EstimationPeter,et al.  The Mathematics of Machine Translation : Parameter , 2004 .

[14]  Hermann Ney,et al.  HMM-Based Word Alignment in Statistical Translation , 1996, COLING.

[15]  Nizar Habash,et al.  MADA + TOKAN : A Toolkit for Arabic Tokenization , Diacritization , Morphological Disambiguation , POS Tagging , Stemming and Lemmatization , 2009 .

[16]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[17]  Emmanuel Morin,et al.  QAlign: A New Method for Bilingual Lexicon Extraction from Comparable Corpora , 2012, CICLing.

[18]  Holger Schwenk,et al.  Parallel sentence generation from comparable corpora for improved SMT , 2011, Machine Translation.

[19]  MarcuDaniel,et al.  Improving Machine Translation Performance by Exploiting Non-Parallel Corpora , 2005 .

[20]  Stephan Vogel,et al.  Adaptive parallel sentences mining from web bilingual news collection , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..