Mining a Comparable Text Corpus for a Vietnamese-French Statistical Machine Translation System

This paper presents our first attempt at constructing a Vietnamese-French statistical machine translation system. Since Vietnamese is an under-resourced language, we concentrate on building a large Vietnamese-French parallel corpus. A document alignment method based on publication date, special words and sentence alignment result is proposed. The paper also presents an application of the obtained parallel corpus to the construction of a Vietnamese-French statistical machine translation system, where the use of different units for Vietnamese (syllables, words, or their combinations) is discussed.

[1]  Robert L. Mercer,et al.  Aligning Sentences in Parallel Corpora , 1991, ACL.

[2]  W. J. Hutchins Machine translation over fifty years , 2001 .

[3]  Mark Steedman Type-Raising and Directionality in Combinatory Grammar , 1991, ACL.

[4]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[5]  Kenneth Ward Church,et al.  A Program for Aligning Sentences in Bilingual Corpora , 1993, CL.

[6]  Wessel Kraaij,et al.  Embedding Web-Based Statistical Translation Models in Cross-Language Information Retrieval , 2003, CL.

[7]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[8]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[9]  Hai Doan-Nguyen,et al.  Generation of Vietnamese for French-Vietnamese and English-Vietnamese Machine Translation , 2001, EWNLG@ACL.

[10]  M. Nagao,et al.  Machine Translation Summit , 1989 .

[11]  Adam Kilgarriff,et al.  Introduction to the Special Issue on the Web as Corpus , 2003, CL.

[12]  Thi Minh Huyen Nguyen Outils et ressources linguistiques pour l'alignement de textes multilingues français-vietnamiens , 2006 .

[13]  Doan-Nguyen Hai,et al.  Generation of Vietnamese for French-Vietnamese and English-Vietnamese machine translation , 2001 .

[14]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[15]  Noah A. Smith,et al.  The Web as a Parallel Corpus , 2003, CL.

[16]  Laurent Besacier,et al.  Using the web for fast language model construction in minority languages , 2003, INTERSPEECH.

[17]  Christopher C. Yang,et al.  Mining English/Chinese Parallel Documents from the World Wide Web , 2002 .

[18]  Xiaoyi Ma,et al.  Champollion: A Robust Parallel Text Sentence Aligner , 2006, LREC.

[19]  Philippe Langlais,et al.  Un système d'identification automatique de documents parallèles , 2005 .

[20]  Dragos Stefan Munteanu,et al.  Extracting Parallel Sub-Sentential Fragments from Non-Parallel Corpora , 2006, ACL.

[21]  Martin Kay,et al.  Text-Translation Alignment , 1993, Comput. Linguistics.