GLUECoS: An Evaluation Benchmark for Code-Switched NLP

Code-switching is the use of more than one language in the same conversation or utterance. Recently, multilingual contextual embedding models, trained on multiple monolingual corpora, have shown promising results on cross-lingual and multilingual tasks. We present an evaluation benchmark, GLUECoS, for code-switched languages, that spans several NLP tasks in English-Hindi and English-Spanish. Specifically, our evaluation benchmark includes Language Identification from text, POS tagging, Named Entity Recognition, Sentiment Analysis, Question Answering and a new task for code-switching, Natural Language Inference. We present results on all these tasks using cross-lingual word embedding models and multilingual models. In addition, we fine-tune multilingual models on artificially generated code-switched data. Although multilingual models perform significantly better than cross-lingual models, our results show that in most tasks, across both language pairs, multilingual models fine-tuned on code-switched data perform best, showing that multilingual models can be further optimized for code-switching tasks.

[1]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[2]  Julia Hirschberg,et al.  Part of Speech Tagging for Code Switched Data , 2016, CodeSwitch@EMNLP.

[3]  Christopher D. Manning,et al.  Bilingual Word Representations with Monolingual Quality in Mind , 2015, VS@HLT-NAACL.

[4]  Xuanjing Huang,et al.  How to Fine-Tune BERT for Text Classification? , 2019, CCL.

[5]  Björn Gambäck On Measuring the Complexity of Code-Mixing , 2014 .

[6]  Prasenjit Majumder,et al.  Overview of the FIRE 2013 Track on Transliterated Search , 2013, FIRE.

[7]  Pushpak Bhattacharyya,et al.  The IIT Bombay English-Hindi Parallel Corpus , 2017, LREC.

[8]  Julia Hirschberg,et al.  Computational Approaches to Linguistic Code Switching , 2016, INTERSPEECH.

[9]  Braja Gopal Patra,et al.  Sentiment Analysis of Code-Mixed Indian Languages: An Overview of SAIL_Code-Mixed Shared Task @ICON-2017 , 2018, ArXiv.

[10]  Miguel A. Alonso,et al.  EN-ES-CS: An English-Spanish Code-Switching Twitter Corpus for Multilingual Sentiment Analysis , 2016, LREC.

[11]  Josef van Genabith,et al.  Code-Mixed Question Answering Challenge: Crowd-sourcing Data and Techniques , 2018, CodeSwitch@ACL.

[12]  Thamar Solorio,et al.  Overview for the Second Shared Task on Language Identification in Code-Switched Data , 2014, CodeSwitch@EMNLP.

[13]  Sumeet Singh,et al.  Language Informed Modeling of Code-Switched Text , 2018, CodeSwitch@ACL.

[14]  Somnath Banerjee,et al.  Overview of the Mixed Script Information Retrieval (MSIR) at FIRE-2016 , 2016, FIRE.

[15]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[16]  Amitava Das,et al.  Collecting and Annotating Indian Social Media Code-Mixed Corpora , 2016, CICLing.

[17]  Phil Blunsom,et al.  Multilingual Models for Compositional Distributed Semantics , 2014, ACL.

[18]  Barbara E. Bullock,et al.  Metrics for Modeling Code-Switching Across Corpora , 2017, INTERSPEECH.

[19]  Riyaz Ahmad Bhat,et al.  Universal Dependency Parsing for Hindi-English Code-Switching , 2018, NAACL.

[20]  Monojit Choudhury,et al.  Estimating Code-Switching on Twitter with a Novel Generalized Word-Level Language Detection Technique , 2017, ACL.

[21]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[22]  Vinay Singh,et al.  Named Entity Recognition for Hindi-English Code-Mixed Social Media Text , 2018, NEWS@ACL.

[23]  Monojit Choudhury,et al.  Proceedings of the The 4th Workshop on Computational Approaches to Code Switching , 2020, CodeSwitch@LREC.

[24]  Monojit Choudhury,et al.  A New Dataset for Natural Language Inference from Code-mixed Conversations , 2020, CALCS.

[25]  Jamin Shin,et al.  Hierarchical Meta-Embeddings for Code-Switching Named Entity Recognition , 2019, EMNLP/IJCNLP.

[26]  Goran Glavas,et al.  How to (Properly) Evaluate Cross-Lingual Word Embeddings: On Strong Baselines, Comparative Analyses, and Some Misconceptions , 2019, ACL.

[27]  Omer Levy,et al.  SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[28]  Monojit Choudhury,et al.  Word Embeddings for Code-Mixed Language Processing , 2018, EMNLP.

[29]  Julia Hirschberg,et al.  Named Entity Recognition on Code-Switched Data: Overview of the CALCS 2018 Shared Task , 2018, CodeSwitch@ACL.

[30]  Eva Schlinger,et al.  How Multilingual is Multilingual BERT? , 2019, ACL.

[31]  Sobha Lalitha Devi,et al.  CMEE-IL: Code Mix Entity Extraction in Indian Languages from Social Media Text @ FIRE 2016 - An Overview , 2016, FIRE.

[32]  Monojit Choudhury,et al.  Language Modeling for Code-Mixing: The Role of Linguistic Theory based Synthetic Data , 2018, ACL.

[33]  Somnath Banerjee,et al.  Overview of FIRE-2015 Shared Task on Mixed Script Information Retrieval , 2015, FIRE Workshops.

[34]  Julia Hirschberg,et al.  Overview for the First Shared Task on Language Identification in Code-Switched Data , 2014, CodeSwitch@EMNLP.

[35]  Guillaume Lample,et al.  Neural Architectures for Named Entity Recognition , 2016, NAACL.

[36]  Guillaume Lample,et al.  Word Translation Without Parallel Data , 2017, ICLR.

[37]  Ali Farhadi,et al.  Bidirectional Attention Flow for Machine Comprehension , 2016, ICLR.

[38]  Anders Søgaard,et al.  A Survey of Cross-lingual Word Embedding Models , 2017, J. Artif. Intell. Res..

[39]  Gokul Chittaranjan,et al.  Overview of FIRE 2014 Track on Transliterated Search , 2014 .