Code-Mixed Question Answering Challenge: Crowd-sourcing Data and Techniques

Code-Mixing (CM) is the phenomenon of alternating between two or more languages which is prevalent in bi- and multi-lingual communities. Most NLP applications today are still designed with the assumption of a single interaction language and are most likely to break given a CM utterance with multiple languages mixed at a morphological, phrase or sentence level. For example, popular commercial search engines do not yet fully understand the intents expressed in CM queries. As a first step towards fostering research which supports CM in NLP applications, we systematically crowd-sourced and curated an evaluation dataset for factoid question answering in three CM languages - Hinglish (Hindi+English), Tenglish (Telugu+English) and Tamlish (Tamil+English) which belong to two language families (Indo-Aryan and Dravidian). We share the details of our data collection process, techniques which were used to avoid inducing lexical bias amongst the crowd workers and other CM specific linguistic properties of the dataset. Our final dataset, which is available freely for research purposes, has 1,694 Hinglish, 2,848 Tamlish and 1,391 Tenglish factoid questions and their answers. We discuss the techniques used by the participants for the first edition of this ongoing challenge.

[1]  Jason Weston,et al.  Reading Wikipedia to Answer Open-Domain Questions , 2017, ACL.

[2]  Ralph Grishman,et al.  Hindi-english cross-lingual question-answering system , 2003, TALIP.

[3]  Somnath Banerjee,et al.  Overview of FIRE-2015 Shared Task on Mixed Script Information Retrieval , 2015, FIRE Workshops.

[4]  Jatin Sharma,et al.  “I am borrowing ya mixing ?" An Analysis of English-Hindi Code Mixing in Facebook , 2014, CodeSwitch@EMNLP.

[5]  Maxine Eskénazi,et al.  Spoken Dialog Challenge 2010: Comparison of Live and Control Test Results , 2011, SIGDIAL Conference.

[6]  Barbara E. Bullock,et al.  Metrics for Modeling Code-Switching Across Corpora , 2017, INTERSPEECH.

[7]  P. Kumar,et al.  A Hindi Question Answering system for E-learning documents , 2005, 2005 3rd International Conference on Intelligent Sensing and Information Processing.

[8]  Chen Xinqing Internet-based Chinese Question-answering System , 2003 .

[9]  Dominique Estival,et al.  Multilingual Semantic Parsing And Code-Switching , 2017, CoNLL.

[10]  Somnath Banerjee,et al.  The First Cross-Script Code-Mixed Question Answering Corpus , 2016, MultiLingMine@ECIR.

[11]  Guillaume Lample,et al.  Word Translation Without Parallel Data , 2017, ICLR.

[12]  Dipti Misra Sharma,et al.  Shallow Parsing Pipeline - Hindi-English Code-Mixed Social Media Text , 2016, NAACL.

[13]  Monojit Choudhury,et al.  Word-level Language Identification using CRF: Code-switching Shared Task Report of MSR India System , 2014, CodeSwitch@EMNLP.

[14]  Julia Hirschberg,et al.  Crowdsourcing Universal Part-of-Speech Tags for Code-Switching , 2017, INTERSPEECH.

[15]  Yang Liu,et al.  Part-of-Speech Tagging for English-Spanish Code-Switched Text , 2008, EMNLP.

[16]  Thierry Poibeau,et al.  Dependency Parsing of Code-Switching Data with Cross-Lingual Feature Representations , 2018 .

[17]  Julia Hirschberg,et al.  Overview for the First Shared Task on Language Identification in Code-Switched Data , 2014, CodeSwitch@EMNLP.

[18]  Sudeshna Sarkar,et al.  Using Word Embeddings for Query Translation for Hindi to English Cross Language Information Retrieval , 2016, Computación y Sistemas.

[19]  Sebastian Ruder,et al.  A survey of cross-lingual embedding models , 2017, ArXiv.

[20]  W. Quin Yow,et al.  Challenging the “ Linguistic Incompetency Hypothesis ” : Language Competency Predicts Code-Switching , 2015 .

[21]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[22]  Yuan Bao-zong Chinese Question Answering Based on Syntax Analysis and Answer Classification , 2008 .

[23]  R. Sinha,et al.  Machine Translation of Bi-lingual Hindi-English (Hinglish) Text , 2005, MTSUMMIT.

[24]  Dan Roth,et al.  Learning question classifiers: the role of semantic information , 2005, Natural Language Engineering.

[25]  Joachim Wagner,et al.  Code Mixing: A Challenge for Language Identification in the Language of Social Media , 2014, CodeSwitch@EMNLP.

[26]  Shana Poplack,et al.  Sometimes I'll Start a Sentence in Spanish Y Termino En Espanol: toward a Typology of Code-switching 1 , 2010 .

[27]  Shana Poplack,et al.  Sometimes I’ll start a sentence in Spanish Y TERMINO EN ESPAÑOL: toward a typology of code-switching1 , 1980 .

[28]  Aravind K. Joshi,et al.  Processing of Sentences With Intra-Sentential Code-Switching , 1982, COLING.

[29]  Susan T. Dumais,et al.  An Analysis of the AskMSR Question-Answering System , 2002, EMNLP.

[30]  Mohit Dua,et al.  A Hindi Question Answering System using Machine Learning approach , 2016, 2016 International Conference on Computational Techniques in Information and Communication Technologies (ICCTICT).

[31]  Mona T. Diab,et al.  Named Entity Recognition for Arabic Social Media , 2015, VS@HLT-NAACL.

[32]  Sivaji Bandyopadhyay,et al.  Dialogue based Question Answering System in Telugu , 2006 .

[33]  Pieter Muysken,et al.  Bilingual Speech: A Typology of Code-Mixing , 2000 .

[34]  Suzanne Romaine One Speaker, Two Languages: Cross-Disciplinary Perspectives on Code-Switching , 1997 .

[35]  Amitava Das,et al.  Part-of-Speech Tagging for Code-Mixed English-Hindi Twitter and Facebook Chat Messages , 2015, RANLP.

[36]  David Crystal,et al.  The Cambridge Encyclopedia of Language , 2012, Modern Language Review.

[37]  Mark Sebba,et al.  On the notions of congruence and convergence in code-switching , 2009 .

[38]  Niloy Ganguly,et al.  Understanding Language Preference for Expression of Opinion and Sentiment: What do Hindi-English Speakers do on Twitter? , 2016, EMNLP.

[39]  Günter Neumann,et al.  A Cross-Language Question/Answering-System for German and English , 2003, CLEF.

[40]  Alan W. Black,et al.  WebShodh: A Code Mixed Factoid Question Answering System for Web , 2017, CLEF.

[41]  Jatin Sharma,et al.  POS Tagging of English-Hindi Code-Mixed Social Media Content , 2014, EMNLP.

[42]  P. Shukla,et al.  A bilingual parser for Hindi , English and code-switching structures , 2022 .

[43]  Manoj Kumar Chinnakotla,et al.  "Answer ka type kya he?": Learning to Classify Questions in Code-Mixed Language , 2015, WWW.

[44]  Larry P. Heck,et al.  Learning deep structured semantic models for web search using clickthrough data , 2013, CIKM.

[45]  N. Poulisse,et al.  Duelling Languages: Grammatical Structure in Codeswitching , 1998 .