A Warm Start and a Clean Crawled Corpus - A Recipe for Good Language Models

We train several language models for Icelandic, including IceBERT, that achieve state-of-the-art performance in a variety of downstream tasks, including part-of-speech tagging, named entity recognition, grammatical error detection and constituency parsing. To train the models we introduce a new corpus of Icelandic text, the Icelandic Common Crawl Corpus (IC3), a collection of high quality texts found online by targeting the Icelandic top-level-domain .is. Several other public data sources are also collected for a total of 16GB of Icelandic text. To enhance the evaluation of model performance and to raise the bar in baselines for Icelandic, we manually translate and adapt the WinoGrande commonsense reasoning dataset. Through these efforts we demonstrate that a properly cleaned crawled corpus is sufficient to achieve state-of-the-art results in NLP applications for low to medium resource languages, by comparison with models trained on a curated corpus. We further show that initializing models using existing multilingual models can lead to state-of-the-art results for some downstream tasks.

[1]  Laurent Romary,et al.  CamemBERT: a Tasty French Language Model , 2019, ACL.

[2]  Hrafn Loftsson,et al.  Named Entity Recognition for Icelandic: Annotated Corpus and Models , 2020, SLSP.

[3]  Charibeth Cheng,et al.  Establishing Baselines for Text Classification in Low-Resource Languages , 2020, ArXiv.

[4]  Martin Boeker,et al.  GottBERT: a pure German Language Model , 2020, ArXiv.

[5]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[6]  Eiríkur Rögnvaldsson,et al.  The Icelandic Parsed Historical Corpus (IcePaHC) , 2012, LREC.

[7]  Jón Guðnason,et al.  Language Technology Programme for Icelandic 2019-2023 , 2020, LREC.

[8]  Usama Khalid,et al.  RUBERT: A Bilingual Roman Urdu BERT Using Cross Lingual Transfer Learning , 2021, ArXiv.

[9]  Ankur Bapna,et al.  Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets , 2021, TACL.

[10]  Yejin Choi,et al.  WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale , 2020, AAAI.

[11]  Jón Guðnason,et al.  Risamálheild: A Very Large Icelandic Text Corpus , 2018, LREC.

[12]  Tapio Salakoski,et al.  Multilingual is not enough: BERT for Finnish , 2019, ArXiv.

[13]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[14]  Vilhjálmur Þorsteinsson Tokenizer for Icelandic text , 2020 .

[15]  Anindya Iqbal,et al.  BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding , 2021 .

[16]  Rémi Louf,et al.  Transformers : State-ofthe-art Natural Language Processing , 2019 .

[17]  Eiríkur Rögnvaldsson,et al.  IceParser: An Incremental Finite-State Parser for Icelandic , 2007, NODALIDA.

[18]  Benjamin Lecouteux,et al.  FlauBERT: Unsupervised Language Model Pre-training for French , 2020, LREC.

[19]  Jan Pomikálek Removing Boilerplate and Duplicate Content from Web Corpora , 2011 .

[20]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[21]  Marjan Ghazvininejad,et al.  Multilingual Denoising Pre-training for Neural Machine Translation , 2020, Transactions of the Association for Computational Linguistics.

[22]  Christopher Cieri,et al.  Selection Criteria for Low Resource Language Programs , 2016, LREC.

[23]  Mark Dredze,et al.  Are All Languages Created Equal in Multilingual BERT? , 2020, REPL4NLP.

[24]  Sello Ralethe Adaptation of Deep Bidirectional Transformers for Afrikaans Language , 2020, LREC.

[25]  Steinþór Steingrímsson,et al.  Augmenting a BiLSTM tagger with a Morphological Lexicon and a Lexical Category Identification Step , 2019, RANLP.

[26]  Mark Dredze,et al.  Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT , 2019, EMNLP.

[27]  Dan Roth,et al.  Cross-Lingual Ability of Multilingual BERT: An Empirical Study , 2019, ICLR.

[28]  Radhika Mamidi,et al.  Clickbait Detection in Telugu: Overcoming NLP Challenges in Resource-Poor Languages using Benchmarked Techniques , 2021, 2021 International Joint Conference on Neural Networks (IJCNN).

[29]  Hector J. Levesque,et al.  The Winograd Schema Challenge , 2011, AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning.

[30]  Dan Klein,et al.  Multilingual Constituency Parsing with Self-Attention and Pre-Training , 2018, ACL.

[31]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[32]  Guillaume Lample,et al.  Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[34]  David Vilares,et al.  Bertinho: Galician BERT Representations , 2021, Proces. del Leng. Natural.

[35]  Eva Schlinger,et al.  How Multilingual is Multilingual BERT? , 2019, ACL.

[36]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[37]  Eneko Agirre,et al.  Give your Text Representation Models some Love: the Case for Basque , 2020, LREC.

[38]  Veselin Stoyanov,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[39]  Colin Raffel,et al.  mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer , 2021, NAACL.

[40]  Eiríkur Rögnvaldsson,et al.  IceNLP: a natural language processing toolkit for icelandic , 2007, INTERSPEECH.

[41]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[42]  Hrafn Loftsson,et al.  A Wide-Coverage Context-Free Grammar for Icelandic and an Accompanying Parsing System , 2019, RANLP.

[43]  Graham Neubig,et al.  XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization , 2020, ICML.

[44]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.