The Nunavut Hansard Inuktitut–English Parallel Corpus 3.0 with Preliminary Machine Translation Results

The Inuktitut language, a member of the Inuit-Yupik-Unangan language family, is spoken across Arctic Canada and noted for its morphological complexity. It is an official language of two territories, Nunavut and the Northwest Territories, and has recognition in additional regions. This paper describes a newly released sentence-aligned Inuktitut–English corpus based on the proceedings of the Legislative Assembly of Nunavut, covering sessions from April 1999 to June 2017. With approximately 1.3 million aligned sentence pairs, this is, to our knowledge, the largest parallel corpus of a polysynthetic language or an Indigenous language of the Americas released to date. The paper describes the alignment methodology used, the evaluation of the alignments, and preliminary experiments on statistical and neural machine translation (SMT and NMT) between Inuktitut and English, in both directions.

[1]  Philipp Koehn,et al.  Manual and Automatic Evaluation of Machine Translation between European Languages , 2006, WMT@HLT-NAACL.

[2]  Paul Okalik Inuktitut and parliamentary terminology , 2011 .

[3]  Richard M. Schwartz,et al.  Fast and Robust Neural Network Joint Models for Statistical Machine Translation , 2014, ACL.

[4]  Joel D. Martin,et al.  Aligning and Using an English-Inuktitut Parallel Corpus , 2003, ParallelTexts@NAACL-HLT.

[5]  Rico Sennrich,et al.  Revisiting Low-Resource Neural Machine Translation: A Case Study , 2019, ACL.

[6]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[7]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[8]  Anna Kazantseva,et al.  Indigenous language technologies in Canada: Assessment, challenges, and successes , 2018, COLING.

[9]  Qian Yu,et al.  Revisiting sentence alignment algorithms for alignment visualization and evaluation , 2012 .

[10]  Maja Popovic,et al.  chrF: character n-gram F-score for automatic MT evaluation , 2015, WMT@EMNLP.

[11]  Marcello Federico,et al.  An Evaluation of Two Vocabulary Reduction Methods for Neural Machine Translation , 2018, AMTA.

[12]  Michel Simard,et al.  Using cognates to align sentences in bilingual corpora , 1993, TMI.

[13]  Ondrej Bojar,et al.  Results of the WMT19 Metrics Shared Task: Segment-Level and Strong MT Systems Pose Big Challenges , 2019, WMT.

[14]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[15]  Anders Søgaard,et al.  A Survey of Cross-lingual Word Embedding Models , 2017, J. Artif. Intell. Res..

[16]  Fabienne Braune,et al.  Improved Unsupervised Sentence Alignment for Symmetrical and Asymmetrical Parallel Corpora , 2010, COLING.

[17]  Roland Kuhn,et al.  Lessons from NRC's Portage System at WMT 2010 , 2010, WMT@ACL.

[18]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[19]  Kenneth Ward Church,et al.  A Program for Aligning Sentences in Bilingual Corpora , 1993, CL.

[20]  Danqi Chen,et al.  of the Association for Computational Linguistics: , 2001 .

[21]  Jeffrey Micher Improving Coverage of an Inuktitut Morphological Analyzer Using a Segmental Recurrent Neural Network , 2017 .

[22]  Marcello Federico,et al.  Linguistically Motivated Vocabulary Reduction for Neural Machine Translation from Turkish to English , 2017, Prague Bull. Math. Linguistics.

[23]  Ashish Vaswani,et al.  Decoding with Large-Scale Neural Language Models Improves Translation , 2013, EMNLP.

[24]  Robert C. Moore Fast and accurate sentence alignment of bilingual corpora , 2002, AMTA.

[25]  Chi-kiu Lo,et al.  YiSi - a Unified Semantic MT Quality Evaluation and Estimation Metric for Languages with Different Levels of Available Resources , 2019, WMT.

[26]  Joel D. Martin,et al.  Word Alignment for Languages with Scarce Resources , 2005, ParallelText@ACL.

[27]  Antonio Toral,et al.  Neural Machine Translation for English-Kazakh with Morphological Segmentation and Synthetic Data , 2019, WMT.

[28]  Stephanie Strassel,et al.  Enriching Word Alignment with Linguistic Tags , 2010, LREC.

[29]  Jeffrey Micher Using the Nunavut Hansard Data for Experiments in Morphological Analysis and Machine Translation , 2018 .

[30]  Brian Thompson,et al.  Vecalign: Improved Sentence Alignment in Linear Time and Space , 2019, EMNLP.

[31]  Mark Hopkins,et al.  Tuning as Ranking , 2011, EMNLP.

[32]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[33]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[34]  Colin Cherry Improved Reordering for Phrase-Based Translation using Sparse Features , 2013, HLT-NAACL.

[35]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[36]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[37]  Ondrej Bojar,et al.  Results of the WMT18 Metrics Shared Task: Both characters and embeddings achieve good performance , 2018, WMT.

[38]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[39]  Michael J. Denkowski,et al.  Sockeye: A Toolkit for Neural Machine Translation , 2017, ArXiv.

[40]  Christopher D. Manning,et al.  Bilingual Word Representations with Monolingual Quality in Mind , 2015, VS@HLT-NAACL.

[41]  Huda Khayrallah,et al.  On the Impact of Various Types of Noise on Neural Machine Translation , 2018, NMT@ACL.

[42]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[43]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[44]  George F. Foster,et al.  Batch Tuning Strategies for Statistical Machine Translation , 2012, NAACL.

[45]  David Chiang,et al.  Forest Rescoring: Faster Decoding with Integrated Language Models , 2007, ACL.

[46]  Kevin Duh,et al.  A Call for Prudent Choice of Subword Merge Operations in Neural Machine Translation , 2019, MTSummit.