Massive vs. Curated Embeddings for Low-Resourced Languages: the Case of Yorùbá and Twi

The success of several architectures to learn semantic representations from unannotated text and the availability of these kind of texts in online multilingual resources such as Wikipedia has facilitated the massive and automatic creation of resources for multiple languages. The evaluation of such resources is usually done for the high-resourced languages, where one has a smorgasbord of tasks and test sets to evaluate on. For low-resourced languages, the evaluation is more difficult and normally ignored, with the hope that the impressive capability of deep learning architectures to learn (multilingual) representations in the high-resourced setting holds in the low-resourced setting too. In this paper we focus on two African languages, Yoruba and Twi, and compare the word embeddings obtained in this way, with word embeddings obtained from curated corpora and a language-dependent processing. We analyse the noise in the publicly available corpora, collect high quality and noisy data for the two languages and quantify the improvements that depend not only on the amount of data but on the quality too. We also use different architectures that learn word representations both from surface forms and characters to further exploit all the available information which showed to be important for these languages. For the evaluation, we manually translate the wordsim-353 word pairs dataset from English into Yoruba and Twi. As output of the work, we provide corpora, embeddings and the test suits for both languages.

[1]  Guillaume Lample,et al.  Word Translation Without Parallel Data , 2017, ICLR.

[2]  Rada Mihalcea,et al.  Cross-lingual Semantic Relatedness Using Encyclopedic Knowledge , 2009, EMNLP.

[3]  Roi Reichart,et al.  Judgment Language Matters: Multilingual Vector Space Models for Judgment Language Aware Lexical Semantics , 2015, ArXiv.

[4]  Zhiyuan Liu,et al.  Joint Learning of Character and Word Embeddings , 2015, IJCAI.

[5]  Suman K. Mitra,et al.  Word Embeddings in Low Resource Gujarati Language , 2019, 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW).

[6]  Prakhar Gupta,et al.  Learning Word Vectors for 157 Languages , 2018, LREC.

[7]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[8]  Jaime G. Carbonell,et al.  Adapting Word Embeddings to New Languages with Morphological and Phonological Subword Representations , 2018, EMNLP.

[9]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[10]  Holger Schwenk,et al.  Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond , 2018, Transactions of the Association for Computational Linguistics.

[11]  Zeljko Agic,et al.  JW300: A Wide-Coverage Parallel Corpus for Low-Resource Languages , 2019, ACL.

[12]  Ehud Rivlin,et al.  Placing search in context: the concept revisited , 2002, TOIS.

[13]  E. Osam,et al.  AN INTRODUCTION TO THE VERBAL AND MULTI-VERBAL SYSTEM OF AKAN , 2003 .

[14]  Diana Inkpen,et al.  Comparison of Semantic Similarity for Different Languages Using the Google n-gram Corpus and Second-Order Co-occurrence Measures , 2011, Canadian Conference on AI.

[15]  Nigel Collier,et al.  SemEval-2017 Task 2: Multilingual and Cross-lingual Semantic Word Similarity , 2017, *SEMEVAL.

[16]  Tunde Adegbola Pattern-based Unsupervised Induction Of Yorùbá Morphology , 2016, WWW.

[17]  Guillaume Lample,et al.  Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[18]  E. Beisner,et al.  Jehovah's Witnesses , 1995 .

[19]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[20]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[21]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[22]  Philip Resnik,et al.  The Bible as a Parallel Corpus: Annotating the ‘Book of 2000 Tongues’ , 1999, Comput. Humanit..

[23]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[24]  Cho-Jui Hsieh,et al.  Learning Word Embeddings for Low-Resource Languages by PU Learning , 2018, NAACL-HLT.

[25]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[26]  Laurent Romary,et al.  CamemBERT: a Tasty French Language Model , 2019, ACL.

[27]  Tunde Adegbola,et al.  Quantifying the effect of corpus size on the quality of automatic diacritization of Yorùbá texts , 2012, SLTU.

[28]  E. R. Adagunodo,et al.  Restoring tone-Marks in Standard YorùBá Electronic Text: Improved Model , 2017, Comput. Sci..

[29]  Iroro Orife Attentive Sequence-to-Sequence Learning for Diacritic Restoration of Yorùbá Language Text , 2018, INTERSPEECH.