Mini But Mighty: Efficient Multilingual Pretraining with Linguistically-Informed Data Selection

With the prominence of large pretrained language models, low-resource languages are rarely modelled monolingually and become victims of the “curse of multilinguality” in massively multilingual models. Recently, AfriBERTa showed that training transformer models from scratch on 1GB of data from many unrelated African languages outperforms massively multilingual models on downstream NLP tasks. Here we extend this direction, focusing on the use of related languages. We propose that training on smaller amounts of data but from related languages could match the performance of models trained on large, unrelated data. We test our hypothesis on the Niger-Congo family and its Bantu and Volta-Niger sub-families, pretraining models with data solely from Niger-Congo languages and finetuning on 4 downstream tasks: NER, part-of-speech tagging, sentiment analysis and text classification. We find that models trained on genetically related languages achieve equal performance on downstream tasks in low-resource languages despite using less training data. We recommend selecting training data based on language-relatedness when pretraining language models for low-resource languages.

[1]  David Ifeoluwa Adelani,et al.  yosm: A new yoruba sentiment corpus for movie reviews , 2022, ArXiv.

[2]  Vukosi Marivate,et al.  Umsuka English - isiZulu Parallel Corpus , 2021 .

[3]  Ankur Bapna,et al.  Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets , 2021, TACL.

[4]  David Ifeoluwa Adelani,et al.  The Effect of Domain and Diacritics in Yoruba–English Neural Machine Translation , 2021, MTSUMMIT.

[5]  Antonios Anastasopoulos,et al.  BembaSpeech: A Speech Recognition Corpus for the Bemba Language , 2021, LREC.

[6]  A. Öktem,et al.  Gamayun - Language Technology for Humanitarian Response , 2020, 2020 IEEE Global Humanitarian Technology Conference (GHTC).

[7]  Hong Qu,et al.  KINNEWS and KIRNEWS: Benchmarking Cross-Lingual Text Classification for Kinyarwanda and Kirundi , 2020, COLING.

[8]  Colin Raffel,et al.  mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer , 2020, NAACL.

[9]  Dietrich Klakow,et al.  Transfer Learning and Distant Supervision for Multilingual Transformer Models: A Study on African Languages , 2020, EMNLP.

[10]  Bonaventure F. P. Dossou,et al.  FFR v1.1: Fon-French Neural Machine Translation , 2020, WINLP.

[11]  Paul Rayson,et al.  Igbo-English Machine Translation: An Evaluation Benchmark , 2020, ArXiv.

[12]  Myle Ott,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[13]  P. A. Owolawi,et al.  Part of Speech Tagging for Setswana African Language , 2019, 2019 International Multidisciplinary Information Technology and Engineering Conference (IMITEC).

[14]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[15]  Guillaume Lample,et al.  Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[16]  Leland McInnes,et al.  UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction , 2018, ArXiv.

[17]  Patrick Littell,et al.  URIEL and lang2vec: Representing languages as typological, geographical, and phylogenetic vectors , 2017, EACL.

[18]  Nkosikhona Dlamini,et al.  Part-of-Speech Tagging and Chunking in Text-to-Speech Synthesis for South African Languages , 2016, INTERSPEECH.

[19]  Roald Eiselen,et al.  Government Domain Named Entity Recognition for South African Languages , 2016, LREC.

[20]  Ikechukwu E. Onyenwe,et al.  Part-of-speech Tagset and Corpus Development for Igbo, an African Language , 2014, LAW@COLING.

[21]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[22]  Saadat M. Alhashmi,et al.  Sentiment analysis amidst ambiguities in youtube comments on yoruba language (nollywood) movies , 2012, WWW.

[23]  Gilles-Maurice de Schryver,et al.  Data-Driven Part-of-Speech Tagging of Kiswahili , 2006, TSD.

[24]  J. Greenberg,et al.  Studies in African linguistic classification , 1957 .

[25]  Jimmy J. Lin,et al.  Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages , 2021, MRL.

[26]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[27]  John O. R. Aoga,et al.  Part-of-Speech tagging of Yoruba Standard, Language of Niger-Congo family , 2013 .