A Comparison of Character Neural Language Model and Bootstrapping for Language Identification in Multilingual Noisy Texts

This paper seeks to examine the effect of including background knowledge in the form of character pre-trained neural language model (LM), and data bootstrapping to overcome the problem of unbalanced limited resources. As a test, we explore the task of language identification in mixed-language short non-edited texts with an under-resourced language, namely the case of Algerian Arabic for which both labelled and unlabelled data are limited. We compare the performance of two traditional machine learning methods and a deep neural networks (DNNs) model. The results show that overall DNNs perform better on labelled data for the majority categories and struggle with the minority ones. While the effect of the untokenised and unlabelled data encoded as LM differs for each category, bootstrapping, however, improves the performance of all systems and all categories. These methods are language independent and could be generalised to other under-resourced languages for which a small labelled data and a larger unlabelled data are available.

[1]  Simon Dobnik,et al.  Identification of Languages in Algerian Arabic Multilingual Documents , 2017, WANLP@EACL.

[2]  Laura Kallmeyer,et al.  Multilingual Code-switching Identification via LSTM Recurrent Neural Networks , 2016, CodeSwitch@EMNLP.

[3]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[4]  Barbara Plank,et al.  When is multitask learning effective? Semantic sequence prediction under varying data conditions , 2016, EACL.

[5]  Grzegorz Chrupala,et al.  DCU-UVT: Word-Level Language Classification with Code-Mixed Data , 2014, CodeSwitch@EMNLP.

[6]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[7]  Monojit Choudhury,et al.  Proceedings of the The 4th Workshop on Computational Approaches to Code Switching , 2020, CodeSwitch@LREC.

[8]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[9]  Mona T. Diab,et al.  Code Switch Point Detection in Arabic , 2013, NLDB.

[10]  Bernhard Schölkopf,et al.  Learning with Local and Global Consistency , 2003, NIPS.

[11]  Tong Zhang,et al.  A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , 2005, J. Mach. Learn. Res..

[12]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[13]  Yoshua Bengio,et al.  How transferable are features in deep neural networks? , 2014, NIPS.

[14]  Shana Poplack,et al.  How Languages Fit Together in Codemixing. , 1998 .

[15]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[16]  Claire Cardie,et al.  Limitations of Co-Training for Natural Language Learning from Large Datasets , 2001, EMNLP.

[17]  Mona T. Diab,et al.  Token Level Identification of Linguistic Code Switching , 2012, COLING.

[18]  Joachim Wagner,et al.  Code Mixing: A Challenge for Language Identification in the Language of Social Media , 2014, CodeSwitch@EMNLP.

[19]  Andrew McCallum,et al.  Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data , 2004, J. Mach. Learn. Res..

[20]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[21]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[22]  Richard Johansson,et al.  Automatic Detection of Arabicized Berber and Arabic Varieties , 2016, VarDial@COLING.

[23]  Ondrej Bojar,et al.  LanideNN: Multilingual Language Identification on Text Stream , 2017, Conference of the European Chapter of the Association for Computational Linguistics.