ANVITA Machine Translation System for WAT 2021 MultiIndicMT Shared Task

This paper describes ANVITA-1.0 MT system, architected for submission to WAT2021 MultiIndicMT shared task by mcairt team, where the team participated in 20 translation directions: English→Indic and Indic→English; Indic set comprised of 10 Indian languages. ANVITA-1.0 MT system comprised of two multi-lingual NMT models one for the English→Indic directions and other for the Indic→English directions with shared encoder-decoder, catering 10 language pairs and twenty translation directions. The base models were built based on Transformer architecture and trained over MultiIndicMT WAT 2021 corpora and further employed back translation and transliteration for selective data augmentation, and model ensemble for better generalization. Additionally, MultiIndicMT WAT 2021 corpora was distilled using a series of filtering operations before putting up for training. ANVITA-1.0 achieved highest AM-FM score for English→Bengali, 2nd for English→Tamil and 3rd for English→Hindi, Bengali→English directions on official test set. In general, performance achieved by ANVITA for the Indic→English directions are relatively better than that of English→Indic directions for all the 10 language pairs when evaluated using BLEU and RIBES, although the same trend is not observed consistently when AM-FM based evaluation was carried out. As compared to BLEU, RIBES and AM-FM based scoring placed ANVITA relatively better among all the task participants.

[1]  Chenhui Chu,et al.  Overview of the 8th Workshop on Asian Translation , 2021, WAT.

[2]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[3]  Jingbo Zhu,et al.  The NiuTrans Machine Translation Systems for WMT19 , 2019, WMT.

[4]  Marcis Pinnis,et al.  Tilde’s Parallel Corpus Filtering Methods for WMT 2018 , 2018, WMT.

[5]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[6]  Kevin Duh,et al.  Automatic Evaluation of Translation Quality for Distant Language Pairs , 2010, EMNLP.

[7]  C. V. Jawahar,et al.  A Multilingual Parallel Corpora Collection Effort for Indian Languages , 2020, LREC.

[8]  Orhan Firat,et al.  Massively Multilingual Neural Machine Translation , 2019, NAACL.

[9]  Zhongjun He,et al.  Robust Neural Machine Translation with Joint Textual and Phonetic Embedding , 2018, ACL.

[10]  Martin Wattenberg,et al.  Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation , 2016, TACL.

[11]  Mitesh M. Khapra,et al.  iNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages , 2020, FINDINGS.

[12]  Alexander M. Rush,et al.  OpenNMT: Open-Source Toolkit for Neural Machine Translation , 2017, ACL.

[13]  Chenhui Chu,et al.  A Comprehensive Survey of Multilingual Neural Machine Translation , 2020, ArXiv.

[14]  Riyaz Ahmad Bhat,et al.  IIIT-H System Submission for FIRE2014 Shared Task on Transliterated Search , 2014, FIRE.

[15]  Rico Sennrich,et al.  Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[16]  Haizhou Li,et al.  Adequacy–Fluency Metrics: Evaluating MT in the Continuous Space Model Framework , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.