MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity Recognition

African languages are spoken by over a billion people, but they are under-represented in NLP research and development. Multiple challenges exist, including the limited availability of annotated training and evaluation datasets as well as the lack of understanding of which settings, languages, and recently proposed methods like cross-lingual transfer will be effective. In this paper, we aim to move towards solutions for these challenges, focusing on the task of named entity recognition (NER). We present the creation of the largest to-date human-annotated NER dataset for 20 African languages. We study the behaviour of state-of-the-art cross-lingual transfer methods in an Africa-centric setting, empirically demonstrating that the choice of source transfer language significantly affects performance. While much previous work defaults to using English as the source language, our results show that choosing the best transfer language improves zero-shot F1 scores by an average of 14% over 20 languages as compared to using English.

[1]  A. C. Tantug,et al.  ANEC: An Amharic Named Entity Corpus and Transformer Based Recognizer , 2022, IEEE Access.

[2]  M. Choudhury,et al.  Multi Task Learning For Zero Shot Performance Prediction of Multilingual Models , 2022, ACL.

[3]  David Ifeoluwa Adelani,et al.  A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African News Translation , 2022, NAACL.

[4]  T. Amagasa,et al.  Named-entity recognition for a low-resource language using pre-trained language model , 2022, SAC.

[5]  Y. Setiawan,et al.  Regression Models for Estimating Aboveground Biomass and Stand Volume Using Landsat-Based Indices in Post-Mining Area , 2022, Jurnal Manajemen Hutan Tropika (Journal of Tropical Forest Management).

[6]  David Ifeoluwa Adelani,et al.  Adapting Pre-trained Language Models to African Languages via Multilingual Adaptive Fine-Tuning , 2022, COLING.

[7]  David Ifeoluwa Adelani,et al.  NaijaSenti: A Nigerian Twitter Sentiment Corpus for Multilingual Sentiment Analysis , 2022, LREC.

[8]  Antonios Anastasopoulos,et al.  Dataset Geography: Mapping Language Data to Language Users , 2021, ACL.

[9]  Ngoc Thang Vu,et al.  AmericasNLI: Evaluating Zero-shot Natural Language Understanding of Pretrained Multilingual Models in Truly Low-resource Languages , 2021, ACL.

[10]  Nicola De Cao,et al.  Multilingual Autoregressive Entity Linking , 2021, TACL.

[11]  Wietse de Vries,et al.  Make the Best of Cross-lingual Transfer: Evidence from POS Tagging with over 100 Languages , 2022, ACL.

[12]  Weizhu Chen,et al.  DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing , 2021, ICLR.

[13]  Yutaka Matsuo,et al.  AfroMT: Pretraining Strategies and Reproducible Benchmarks for Translation of 8 African Languages , 2021, EMNLP.

[14]  Alexander M. Rush,et al.  Datasets: A Community Library for Natural Language Processing , 2021, EMNLP.

[15]  Alice H. Oh,et al.  KLUE: Korean Language Understanding Evaluation , 2021, NeurIPS Datasets and Benchmarks.

[16]  Jinlan Fu,et al.  XTREME-R: Towards More Challenging and Nuanced Multilingual Evaluation , 2021, EMNLP.

[17]  Graham Neubig,et al.  ExplainaBoard: An Explainable Leaderboard for NLP , 2021, ACL.

[18]  Graham Neubig,et al.  MasakhaNER: Named Entity Recognition for African Languages , 2021, Transactions of the Association for Computational Linguistics.

[19]  David Ifeoluwa Adelani,et al.  The Effect of Domain and Diacritics in Yoruba–English Neural Machine Translation , 2021, MTSUMMIT.

[20]  Sebastian Ruder,et al.  Rethinking embedding coupling in pre-trained language models , 2020, ICLR.

[21]  Jimmy J. Lin,et al.  Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages , 2021, MRL.

[22]  Sara Stymne,et al.  Investigation of Transfer Languages for Parsing Latin: Italic Branch vs. Hellenic Branch , 2021, NODALIDA.

[23]  Chris Biemann,et al.  Exploring Amharic Sentiment Analysis from Social Media Texts: Building Annotation Tools and Classification Models , 2020, COLING.

[24]  Goran Glavaš,et al.  From Zero to Hero: On the Limitations of Zero-Shot Language Transfer with Multilingual Transformers , 2020, EMNLP.

[25]  Dietrich Klakow,et al.  Transfer Learning and Distant Supervision for Multilingual Transformer Models: A Study on African Languages , 2020, EMNLP.

[26]  Hady Elsahar,et al.  Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages , 2020, FINDINGS.

[27]  Samuel R. Bowman,et al.  Intermediate-Task Transfer Learning with Pretrained Language Models: When and Why Does It Work? , 2020, ACL.

[28]  Yiming Yang,et al.  Predicting Performance for Natural Language Processing Tasks , 2020, ACL.

[29]  Anders Søgaard,et al.  DaNE: A Named Entity Resource for Danish , 2020, LREC.

[30]  Alexander Erdmann,et al.  CAMeL Tools: An Open Source Python Toolkit for Arabic Natural Language Processing , 2020, LREC.

[31]  A. Korhonen,et al.  XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning , 2020, EMNLP.

[32]  Iryna Gurevych,et al.  MAD-X: An Adapter-based Framework for Multi-task Cross-lingual Transfer , 2020, EMNLP.

[33]  Orhan Firat,et al.  XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization , 2020, ICML.

[34]  Quoc V. Le,et al.  ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.

[35]  Massimiliano Pontil,et al.  Multi-task Learning , 2020, Transfer Learning.

[36]  Dan Roth,et al.  Cross-Lingual Ability of Multilingual BERT: An Empirical Study , 2019, ICLR.

[37]  Myle Ott,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[38]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[39]  Stefan Daniel Dumitrescu,et al.  Introducing RONEC - the Romanian Named Entity Corpus , 2019, LREC.

[40]  Miikka Silfverberg,et al.  A Finnish news corpus for named entity recognition , 2019, Language Resources and Evaluation.

[41]  Siti Oryza Khairunnisa,et al.  Towards a Standardized Dataset on Indonesian Named Entity Recognition , 2020, AACL.

[42]  Ankur Padia,et al.  Named Entity Recognition for Nepali Language , 2019, 2019 IEEE 5th International Conference on Collaboration and Internet Computing (CIC).

[43]  M. Konoshenko,et al.  A microtypological survey of noun classes in Kwa , 2019, Journal of African Languages and Linguistics.

[44]  Graham Neubig,et al.  Choosing Transfer Languages for Cross-Lingual Learning , 2019, ACL.

[45]  Mark Dredze,et al.  Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT , 2019, EMNLP.

[46]  Stephen D. Mayhew,et al.  ner and pos when nothing is capitalized , 2019, EMNLP.

[47]  Bjarte Johansen,et al.  Named-Entity Recognition for Norwegian , 2019, NODALIDA.

[48]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[49]  Heng Ji,et al.  Platforms for Non-speakers Annotating Names in Any Language , 2018, ACL.

[50]  Peteris Paikens,et al.  Creation of a Balanced State-of-the-Art Multilayer Corpus for NLU , 2018, LREC.

[51]  Patrick Littell,et al.  URIEL and lang2vec: Representing languages as typological, geographical, and phylogenetic vectors , 2017, EACL.

[52]  M. Loporcaro,et al.  Noun classes and grammatical gender in Wolof , 2016 .

[53]  Stephanie Strassel,et al.  LORELEI Language Packs: Data, Tools, and Resources for Technology Development in Low Resource Languages , 2016, LREC.

[54]  Clemens Neudecker,et al.  An Open Corpus for Named Entity Recognition in Historic Newspapers , 2016, LREC.

[55]  Roald Eiselen,et al.  Government Domain Named Entity Recognition for South African Languages , 2016, LREC.

[56]  Massimo Piccardi,et al.  PersoNER: Persian Named-Entity Recognition , 2016, COLING.

[57]  Hugo Gonçalo Oliveira,et al.  Second HAREM: Advancing the State of the Art of Named Entity Recognition in Portuguese , 2010, LREC.

[58]  Yassine Benajiba,et al.  ANERsys: An Arabic Named Entity Recognition System Based on Maximum Entropy , 2009, CICLing.

[59]  János Csirik,et al.  A highly accurate Named Entity corpus for Hungarian , 2006, LREC.

[60]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[61]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2002 Shared Task: Language-Independent Named Entity Recognition , 2002, CoNLL.

[62]  Adams Bodomo,et al.  The Morphophonology of Noun Classes in Dagaare and Akan , 2002 .

[63]  Kees Versteegh,et al.  Linguistic Contacts Between Arabic and Other Languages , 2001 .

[64]  N. J. van Warmelo,et al.  Introduction to the Phonology of the Bantu Languages (Grundriss einer Lautlehre der Bantusprachen) , 1934 .