Extended Parallel Corpus for Amharic-English Machine Translation

This paper describes the acquisition, preprocessing, segmentation, and alignment of an AmharicEnglish parallel corpus. It will be useful for machine translation of an under-resourced language, Amharic. The corpus is larger than previously compiled corpora; it is released2 for research purposes. We trained neural machine translation and phrase-based statistical machine translation models using the corpus. In the automatic evaluation, neural machine translation models outperform phrase-based statistical machine translation models. 2012 ACM Subject Classification Computing methodologies → Machine translation

[1]  David Chiang,et al.  Forest Rescoring: Faster Decoding with Integrated Language Models , 2007, ACL.

[2]  Hermann Ney,et al.  CharacTer: Translation Edit Rate on Character Level , 2016, WMT.

[3]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Philipp Koehn,et al.  Re-evaluating the Role of Bleu in Machine Translation Research , 2006, EACL.

[5]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[6]  Peter A. Chew,et al.  Evaluation of the Bible as a Resource for Cross-Language Information Retrieval , 2006 .

[7]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[8]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[9]  Atelach Alemu Argaw,et al.  Web Mining for an Amharic - English Bilingual Corpus , 2005, WEBIST.

[10]  Shankar Kumar,et al.  Minimum Bayes-Risk Decoding for Statistical Machine Translation , 2004, NAACL.

[11]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[12]  Robert C. Moore Fast and accurate sentence alignment of bilingual corpora , 2002, AMTA.

[13]  Chris Callison-Burch,et al.  Open Source Toolkit for Statistical Machine Translation: Factored Translation Models and Lattice Decoding , 2006 .

[14]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[15]  Kenneth Heafield,et al.  KenLM: Faster and Smaller Language Model Queries , 2011, WMT@EMNLP.

[16]  Philipp Koehn,et al.  Edinburgh’s Submission to all Tracks of the WMT 2009 Shared Task with Reordering and Speed Improvements to Moses , 2009, WMT@EACL.

[17]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[18]  Ehud Reiter,et al.  A Structured Review of the Validity of BLEU , 2018, CL.

[19]  Philipp Koehn,et al.  Six Challenges for Neural Machine Translation , 2017, NMT@ACL.

[20]  Rico Sennrich,et al.  Revisiting Low-Resource Neural Machine Translation: A Case Study , 2019, ACL.

[21]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[22]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[23]  Guillaume Lample,et al.  Phrase-Based & Neural Unsupervised Machine Translation , 2018, EMNLP.

[24]  The JHU Machine Translation Systems for WMT 2018 , 2016, WMT.

[25]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[26]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[27]  George F. Foster,et al.  Batch Tuning Strategies for Statistical Machine Translation , 2012, NAACL.

[28]  Kevin Duh,et al.  A Call for Prudent Choice of Subword Merge Operations in Neural Machine Translation , 2019, MTSummit.

[29]  Khalil Sima'an,et al.  Fitting Sentence Level Translation Evaluation with Many Dense Features , 2014, EMNLP.

[30]  Rico Sennrich,et al.  Edinburgh’s Statistical Machine Translation Systems for WMT16 , 2016, WMT.

[31]  Stephanie Strassel,et al.  Basic Language Resources for 31 Languages (Plus English): The LORELEI Representative and Incident Language Packs , 2020, SLTU.