论文信息 - Towards a parallel corpus of Portuguese and the Bantu language Emakhuwa of Mozambique

Towards a parallel corpus of Portuguese and the Bantu language Emakhuwa of Mozambique

Major advancement in the performance of machine translation models has been made possible in part thanks to the availability of largescale parallel corpora. But for most languages in the world, the existence of such corpora is rare. Emakhuwa, a language spoken in Mozambique, is like most African languages low-resource in NLP terms. It lacks both computational and linguistic resources and, to the best of our knowledge, few parallel corpora including Emakhuwa already exist. In this paper we describe the creation of the EmakhuwaPortuguese parallel corpus, which is a collection of texts from the Jehovah’s Witness website and a variety of other sources including the African Story Book website, the Universal Declaration of Human Rights and Mozambican legal documents. The dataset contains 47,415 sentence pairs, amounting to 699,976 word tokens of Emakhuwa and 877,595 word tokens in Portuguese. After normalization processes which remain to be completed, the corpus will be made freely available for research use.

Andrew Caines | Felermino D. M. A. Ali | Felermino M. D. A. Ali | Jaimito L. A. Malavi | Andrew Caines

[1] A. Ngunga,et al. Padronização da Ortografia de Línguas Moçambicanas: relatorio do III Seminário , 2012 .

[2] Hinrich Schütze,et al. Past, Present, Future: A Computational Investigation of the Typology of Tense in 1000 Languages , 2017, EMNLP.

[3] Phil Blunsom,et al. Recurrent Continuous Translation Models , 2013, EMNLP.

[4] Hady Elsahar,et al. Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages , 2020, FINDINGS.

[5] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[6] Jörg Tiedemann,et al. Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[7] A. Ngunga. Interferências de Línguas Moçambicanas em Português falado em Moçambique , 2012 .

[8] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[9] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.