NaijaSenti: A Nigerian Twitter Sentiment Corpus for Multilingual Sentiment Analysis

Sentiment analysis is one of the most widely studied applications in NLP, but most work focuses on languages with large amounts of data. We introduce the first large-scale human-annotated Twitter sentiment dataset for the four most widely spoken languages in Nigeria—Hausa, Igbo, Nigerian-Pidgin, and Yorùbá—consisting of around 30,000 annotated tweets per language, including a significant fraction of code-mixed tweets. We propose text collection, filtering, processing and labeling methods that enable us to create datasets for these low-resource languages. We evaluate a range of pre-trained models and transfer strategies on the dataset. We find that language-specific models and language-adaptive fine-tuning generally perform best. We release the datasets, trained models, sentiment lexicons, and code to incentivize research on sentiment analysis in under-represented languages.

[1]  Vinodkumar Prabhakaran,et al.  Dealing with Disagreements: Looking Beyond the Majority Vote in Subjective Annotations , 2021, TACL.

[2]  Po-Sen Huang,et al.  Ethical and social risks of harm from Language Models , 2021, ArXiv.

[3]  Jianfeng Gao,et al.  DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing , 2021, ArXiv.

[4]  Tal Perry,et al.  LightTag: Text Annotation Platform , 2021, EMNLP.

[5]  Dirk Hovy,et al.  Beyond Black & White: Leveraging Annotator Disagreement via Soft-Label Multi-Task Learning , 2021, NAACL.

[6]  Graham Neubig,et al.  MasakhaNER: Named Entity Recognition for African Languages , 2021, Transactions of the Association for Computational Linguistics.

[7]  Mirka Honkanen,et al.  Interjections and emojis in Nigerian online communication , 2021, World Englishes.

[8]  M. Xia,et al.  MetaXL: Meta Representation Transformation for Low-resource Cross-lingual Learning , 2021, NAACL.

[9]  Iryna Gurevych,et al.  UNKs Everywhere: Adapting Multilingual Language Models to New Scripts , 2020, EMNLP.

[10]  Christopher Potts,et al.  DynaSent: A Dynamic Benchmark for Sentiment Analysis , 2020, ACL.

[11]  Benoit Sagot,et al.  When Being Unseen from mBERT is just the Beginning: Handling New Languages With Multilingual Language Models , 2020, NAACL.

[12]  Sebastian Ruder,et al.  Rethinking embedding coupling in pre-trained language models , 2020, ICLR.

[13]  Ibrahim Saidu,et al.  An Enhanced Feature Acquisition for Sentiment Analysis of English and Hausa Tweets , 2021, International Journal of Advanced Computer Science and Applications.

[14]  Jimmy J. Lin,et al.  Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages , 2021, MRL.

[15]  Chris Biemann,et al.  Exploring Amharic Sentiment Analysis from Social Media Texts: Building Annotation Tools and Classification Models , 2020, COLING.

[16]  Ankur Bapna,et al.  Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus , 2020, COLING.

[17]  Hady Elsahar,et al.  Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages , 2020, FINDINGS.

[18]  Noah A. Smith,et al.  Parsing with Multilingual BERT, a Small Corpus, and a Small Treebank , 2020, Findings of the Association for Computational Linguistics: EMNLP 2020.

[19]  Uduak A. Umoh,et al.  Using interval type-2 fuzzy logic to analyze Igbo emotion words , 2020 .

[20]  Sayeed Ghani,et al.  Sentiment Analysis on Urdu Tweets Using Markov Chains , 2020, SN Computer Science.

[21]  Ikechukwu Onyenwe,et al.  The impact of political party/candidate on the election results from a sentiment analysis perspective using #AnambraDecides2017 tweets , 2020, Social Network Analysis and Mining.

[22]  Tolulope Olagunju,et al.  Exploring Key Issues Affecting African Mobile eCommerce Applications Using Sentiment and Thematic Analysis , 2020, IEEE Access.

[23]  Stergios Chatzikyriakidis,et al.  An Arabic Tweets Sentiment Analysis Dataset (ATSAD) using Distant Supervision and Self Training , 2020, OSACT.

[24]  Farah Benamara,et al.  An Algerian Corpus and an Annotation Platform for Opinion and Emotion Analysis , 2020, LREC.

[25]  Iryna Gurevych,et al.  MAD-X: An Adapter-based Framework for Multi-task Cross-lingual Transfer , 2020, EMNLP.

[26]  Doug Downey,et al.  Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks , 2020, ACL.

[27]  Monojit Choudhury,et al.  The State and Fate of Linguistic Diversity and Inclusion in the NLP World , 2020, ACL.

[28]  Wuraola Fisayo Oyewusi,et al.  Semantic Enrichment of Nigerian Pidgin English for Contextual Sentiment Classification , 2020, ArXiv.

[29]  Quoc V. Le,et al.  ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.

[30]  David Ifeoluwa Adelani,et al.  Massive vs. Curated Embeddings for Low-Resourced Languages: the Case of Yorùbá and Twi , 2019, LREC.

[31]  Myle Ott,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[32]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[33]  Ochilbek Rakhmanov A Comparative Study on Vectorization and Classification Techniques in Sentiment Analysis to Classify Student-Lecturer Comments , 2020 .

[34]  Rita Orji,et al.  Social Media and Sentiment Analysis: The Nigeria Presidential Election 2019 , 2019, 2019 IEEE 10th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON).

[35]  Abímbólá Rhoda Ìyàndá,et al.  Predicting Sentiment in Yorùbá Written Texts: A Comparison of Machine Learning Models , 2019, IntelliSys.

[36]  Emeka Ogbuju,et al.  Development of a General Purpose Sentiment Lexicon for Igbo Language , 2019, WNLP@ACL.

[37]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[38]  Olawande Daramola,et al.  Sentiment Analysis on Naija-Tweets , 2019, ACL.

[39]  Samuel R. Bowman,et al.  Human vs. Muppet: A Conservative Estimate of Human Performance on the GLUE Benchmark , 2019, ACL.

[40]  Samuel R. Bowman,et al.  Neural Network Acceptability Judgments , 2018, Transactions of the Association for Computational Linguistics.

[41]  Rémi Louf,et al.  Transformers : State-ofthe-art Natural Language Processing , 2019 .

[42]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[43]  Jari Salo,et al.  Sentiment analysis of social commerce: a harbinger of online reputation management , 2018, Int. J. Electron. Bus..

[44]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[45]  Maria das Graças Volpe Nunes,et al.  Building a Sentiment Corpus of Tweets in Brazilian Portuguese , 2017, LREC.

[46]  E. S. Nwofe,et al.  Pro-Biafran Activists and the call for a Referendum: A Sentiment Analysis of ‘Biafraexit’ on Twitter after UK’s vote to leave the European Union , 2017 .

[47]  Craig MacDonald,et al.  Using word embeddings in Twitter election classification , 2016, Information Retrieval Journal.

[48]  Preslav Nakov,et al.  SemEval-2017 Task 4: Sentiment Analysis in Twitter , 2017, *SEMEVAL.

[49]  Hend Suliman Al-Khalifa,et al.  AraSenTi-Tweet: A Corpus for Arabic Sentiment Analysis of Saudi Tweets , 2017, ACLING.

[50]  Saif Mohammad,et al.  A Practical Guide to Sentiment Annotation: Challenges and Solutions , 2016, WASSA@NAACL-HLT.

[51]  Navneet Kaur,et al.  Opinion mining and sentiment analysis , 2016, 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom).

[52]  Igor Mozetic,et al.  Multilingual Twitter Sentiment Classification: The Role of Human Annotators , 2016, PloS one.

[53]  Petra Kralj Novak,et al.  Sentiment of Emojis , 2015, PloS one.

[54]  Preslav Nakov,et al.  Sentiment Analysis in Twitter for Macedonian , 2015, RANLP.

[55]  Mirna Adriani,et al.  Automatically Building a Corpus for Sentiment Analysis on Indonesian Tweets , 2014, PACLIC.

[56]  Verena Rieser,et al.  An Arabic Twitter Corpus for Subjectivity and Sentiment Analysis , 2014, LREC.

[57]  Rada Mihalcea,et al.  Sentiment Analysis , 2014, Encyclopedia of Social Network Analysis and Mining.

[58]  Saif Mohammad,et al.  CROWDSOURCING A WORD–EMOTION ASSOCIATION LEXICON , 2013, Comput. Intell..

[59]  Preslav Nakov,et al.  SemEval-2013 Task 2: Sentiment Analysis in Twitter , 2013, *SEMEVAL.

[60]  James Pustejovsky,et al.  SemEval-2013 Task 1: TempEval-3: Evaluating Time Expressions, Events, and Temporal Relations , 2013, *SEMEVAL.

[61]  Giuseppe Jurman,et al.  A Comparison of MCC and CEN Error Measures in Multi-Class Prediction , 2012, PloS one.

[62]  Saadat M. Alhashmi,et al.  Sentiment analysis amidst ambiguities in youtube comments on yoruba language (nollywood) movies , 2012, WWW.

[63]  Akshi Kumar,et al.  Sentiment Analysis on Twitter , 2012 .

[64]  Finn Årup Nielsen,et al.  A New ANEW: Evaluation of a Word List for Sentiment Analysis in Microblogs , 2011, #MSM.

[65]  Patrick Paroubek,et al.  Twitter as a Corpus for Sentiment Analysis and Opinion Mining , 2010, LREC.

[66]  Chinyere Ohiri-Aniche Stemming the tide of centrifugal forces in Igbo orthography , 2007 .

[67]  Aladarn Tsarin,et al.  HAUSA , 2005, Cheers!.

[68]  Bing Liu,et al.  Mining and summarizing customer reviews , 2004, KDD.

[69]  Danqi Chen,et al.  of the Association for Computational Linguistics: , 2001 .

[70]  J. Fleiss,et al.  Statistical methods for rates and proportions , 1973 .