Alignment of the Polish-English Parallel Text for a Statistical Machine "Translation

Text alignment is crucial to the accuracy of MT (Machine Translation) systems, some NLP (Natural Language Processing) tools or any other text processing tasks requiring bilingual data. This research proposes a language independent sentence alignment approach based on Polish (not position-sensitive language) to English experiments. This alignment approach was developed on the TED (Translanguage English Database) talks corpus, but can be used for any text domain or language pair. The proposed approach implements various heuristics for sentence recognition. Some of them value synonyms and semantic text structure analysis as a part of additional information. Minimization of data loss was ensured. The solution is compared to other sentence alignment implementations. Also an improvement in MT system score with text processed with the described tool is shown.

[1]  Armin Schmidt Ringstraße Statistical Machine Translation Between New Language Pairs Using Multiple Intermediaries , 2007 .

[2]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[3]  Kenneth Ward Church,et al.  Identifying Word Correspondences in Parallel Texts , 1991, HLT.

[4]  Laurent Romary,et al.  The Lingua Parallel Concordancing Project: Managing Multilingual Texts for Educational Purpose , 1993 .

[5]  Noah A. Smith,et al.  pycdec: A Python Interface to cdec , 2012, Prague Bull. Math. Linguistics.

[6]  Falk Scholer,et al.  Machine transliteration survey , 2011, ACM Comput. Surv..

[7]  Krzysztof Marasek TED Polish-to-English translation system for the IWSLT 2012 , 2012, IWSLT.

[8]  Y. O N G G A N G D E N G,et al.  Segmentation and alignment of parallel text for statistical machine translation , 2005 .

[9]  Shankar Kumar,et al.  Minimum Bayes-Risk Decoding for Statistical Machine Translation , 2004, NAACL.

[10]  Robert L. Mercer,et al.  Aligning Sentences in Parallel Corpora , 1991, ACL.

[11]  Lucia Specia,et al.  Machine translation evaluation versus quality estimation , 2010, Machine Translation.

[12]  Mauro Cettolo,et al.  WIT3: Web Inventory of Transcribed and Translated Talks , 2012, EAMT.

[13]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[14]  Vladimir Eidelman,et al.  cdec: A Decoder, Alignment, and Learning Framework for Finite- State and Context-Free Translation Models , 2010, ACL.

[15]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[16]  Fabienne Braune,et al.  Improved Unsupervised Sentence Alignment for Symmetrical and Asymmetrical Parallel Corpora , 2010, COLING.

[17]  Hermann Ney,et al.  Minimum Bayes Risk Decoding for BLEU , 2007, ACL.