From English to Code-Switching: Transfer Learning with Strong Morphological Clues

Linguistic Code-switching (CS) is still an understudied phenomenon in natural language processing. The NLP community has mostly focused on monolingual and multi-lingual scenarios, but little attention has been given to CS in particular. This is partly because of the lack of resources and annotated data, despite its increasing occurrence in social media platforms. In this paper, we aim at adapting monolingual models to code-switched text in various tasks. Specifically, we transfer English knowledge from a pre-trained ELMo model to different code-switched language pairs (i.e., Nepali-English, Spanish-English, and Hindi-English) using the task of language identification. Our method, CS-ELMo, is an extension of ELMo with a simple yet effective position-aware attention mechanism inside its character convolutions. We show the effectiveness of this transfer learning step by outperforming multilingual BERT and homologous CS-unaware ELMo models and establishing a new state of the art in CS tasks, such as NER and POS tagging. Our technique can be expanded to more English-paired code-switched languages, providing more resources to the CS community.

[1]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[2]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[3]  Björn Gambäck On Measuring the Complexity of Code-Mixing , 2014 .

[4]  Pascale Fung,et al.  Bilingual Character Representation for Efficiently Addressing Out-of-Vocabulary Words in Code-Switching Named Entity Recognition , 2018, CodeSwitch@ACL.

[5]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[6]  Mona T. Diab,et al.  LILI: A Simple Language Independent Approach for Language Identification , 2016, COLING.

[7]  Julia Hirschberg,et al.  Overview for the First Shared Task on Language Identification in Code-Switched Data , 2014, CodeSwitch@EMNLP.

[8]  Rotem Dror,et al.  The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing , 2018, ACL.

[9]  Matthijs Douze,et al.  Learning Joint Multilingual Sentence Representations with Neural Machine Translation , 2017, Rep4NLP@ACL.

[10]  Julia Hirschberg,et al.  Named Entity Recognition on Code-Switched Data: Overview of the CALCS 2018 Shared Task , 2018, CodeSwitch@ACL.

[11]  Holger Schwenk,et al.  A Corpus for Multilingual Document Classification in Eight Languages , 2018, LREC.

[12]  Slav Petrov,et al.  A Universal Part-of-Speech Tagset , 2011, LREC.

[13]  Ponnurangam Kumaraguru,et al.  A Twitter Corpus for Hindi-English Code Mixed POS Tagging , 2018, SocialNLP@ACL.

[14]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[15]  Dan Garrette,et al.  Part-of-Speech Tagging for Code-Switched, Transliterated Texts without Explicit Language Identification , 2018, EMNLP.

[16]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[17]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[18]  Roland Vollgraf,et al.  Contextual String Embeddings for Sequence Labeling , 2018, COLING.

[19]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[20]  Sebastian Ruder,et al.  Fine-tuned Language Models for Text Classification , 2018, ArXiv.

[21]  Pascale Fung,et al.  Learning Multilingual Meta-Embeddings for Code-Switching Named Entity Recognition , 2019, RepL4NLP@ACL.

[22]  Riyaz Ahmad Bhat,et al.  Language Identification in Code-Switching Scenario , 2014, CodeSwitch@EMNLP.

[23]  Alan W. Black,et al.  A Survey of Code-switched Speech and Language Processing , 2019, ArXiv.

[24]  Jürgen Schmidhuber,et al.  Highway Networks , 2015, ArXiv.

[25]  Kyunghyun Cho,et al.  Code-Switched Named Entity Recognition with Embedding Attention , 2018, CodeSwitch@ACL.

[26]  Anil Kumar Singh,et al.  IIT (BHU) Submission for the ACL Shared Task on Named Entity Recognition on Code-switched Data , 2018, CodeSwitch@ACL.

[27]  Holger Schwenk,et al.  Filtering and Mining Parallel Data in a Joint Multilingual Space , 2018, ACL.

[28]  Thamar Solorio,et al.  Overview for the Second Shared Task on Language Identification in Code-Switched Data , 2014, CodeSwitch@EMNLP.

[29]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[30]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[31]  Katharina Kann,et al.  Subword-Level Language Identification for Intra-Word Code-Switching , 2019, NAACL.

[32]  Gülsen Eryigit,et al.  Detecting Code-Switching between Turkish-English Language Pair , 2018, NUT@EMNLP.

[33]  Thamar Solorio,et al.  Language Identification and Analysis of Code-Switched Social Media Text , 2018, CodeSwitch@ACL.