Fully Automatic Compilation of Portuguese-English and Portuguese-Spanish Parallel Corpora

This paper reports the fully automatic compilation of parallel cor- pora for Brazilian Portuguese. Scientific news texts available in Brazilian Por- tuguese, English and Spanish are automatically crawled from a multilingual Brazilian magazine. The texts are then automatically aligned at document- and sentence-level. The resulting corpora contain about 2,700 parallel documents totaling over 150,000 aligned sentences each. The quality of the corpora and their usefulness are tested in an experiment with machine translation.

[1]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[2]  Knut Hofland A Program for Aligning English and Norwegian Sentences , 1995 .

[3]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[4]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[5]  Helena de Medeiros Caseli,et al.  Sentence Alignment of Brazilian Portuguese and English Parallel Texts , 2003 .

[6]  David Chiang,et al.  A Hierarchical Phrase-Based Model for Statistical Machine Translation , 2005, ACL.

[7]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[8]  Mikel L. Forcada,et al.  Automatic induction of bilingual resources from aligned parallel corpora: application to shallow-transfer machine translation , 2007, Machine Translation.

[9]  Ivandré Paraboni,et al.  Building a Spanish-Portuguese parallel corpus for statistical machine translation , 2008 .

[10]  Philipp Koehn,et al.  Towards better Machine Translation Quality for the German-English Language Pairs , 2008, WMT@ACL.

[11]  Statistical Phrase-based Machine Translation : Experiments with Brazilian Portuguese , 2009 .

[12]  Mikel L. Forcada,et al.  Combining Content-Based and URL-Based Heuristics to Harvest Aligned Bitexts from Multilingual Sites with Bitextor , 2010, Prague Bull. Math. Linguistics.

[13]  C. D. Souza,et al.  Coreference resolution for portuguese using parallel corpora word alignment , 2011 .