WebShodh: A Code Mixed Factoid Question Answering System for Web

Code-Mixing (CM) is a natural phenomenon observed in many multilingual societies and is becoming the preferred medium of expression and communication in online and social media fora. In spite of this, current Question Answering (QA) systems do not support CM and are only designed to work with a single interaction language. This assumption makes it inconvenient for multi-lingual users to interact naturally with the QA system especially in scenarios where they do not know the right word in the target language. In this paper, we present WebShodh - an end-end web-based Factoid QA system for CM languages. We demonstrate our system with two CM language pairs: Hinglish (Matrix language: Hindi, Embedded language: English) and Tenglish (Matrix language: Telugu, Embedded language: English). Lack of language resources such as annotated corpora, POS taggers or parsers for CM languages poses a huge challenge for automated processing and analysis. In view of this resource scarcity, we only assume the existence of bi-lingual dictionaries from the matrix languages to English and use it for lexically translating the question into English. Later, we use this loosely translated question for our downstream analysis such as Answer Type(AType) prediction, answer retrieval and ranking. Evaluation of our system reveals that we achieve an MRR of 0.37 and 0.32 for Hinglish and Tenglish respectively. We hosted this system online and plan to leverage it for collecting more CM questions and answers data for further improvement.

[1]  Mary W. J. Tay,et al.  Code switching and code mixing as a communicative strategy in multilingual discourse , 1989 .

[2]  Pieter Muysken,et al.  One Speaker, Two Languages: Cross-Disciplinary Perspectives on Code-Switching , 1995 .

[3]  A. Backus Code-switching in conversation: Language, interaction and identity , 2000 .

[4]  Dan Roth,et al.  Learning Question Classifiers , 2002, COLING.

[5]  Susan T. Dumais,et al.  An Analysis of the AskMSR Question-Answering System , 2002, EMNLP.

[6]  Carol Myers-Scotton,et al.  Contact Linguistics: Bilingual encounters and grammatical outcomes , 2013 .

[7]  Dell Zhang,et al.  Question classification using support vector machines , 2003, SIGIR.

[8]  Dell Zhang,et al.  A Web-based Question Answering System , 2003 .

[9]  Maarten de Rijke,et al.  Overview of the CLEF 2004 Multilingual Question Answering Track , 2004, CLEF.

[10]  Beatrice Alex,et al.  Automatic detection of English inclusions in mixed-lingual text with an application to parsing , 2008 .

[11]  Jennifer Chu-Carroll,et al.  Building Watson: An Overview of the DeepQA Project , 2010, AI Mag..

[12]  Siddharth Patwardhan,et al.  Using Syntactic and Semantic Structural Kernels for Classifying Definition Questions in Jeopardy! , 2011, EMNLP.

[13]  Yuan Wang,et al.  A Classification of Questions Using SVM and Semantic Similarity Analysis , 2012, 2012 Sixth International Conference on Internet Computing for Science and Engineering.

[14]  Joachim Wagner,et al.  Code Mixing: A Challenge for Language Identification in the Language of Social Media , 2014, CodeSwitch@EMNLP.

[15]  Jatin Sharma,et al.  POS Tagging of English-Hindi Code-Mixed Social Media Content , 2014, EMNLP.

[16]  Pascale Fung,et al.  A Hindi-English Code-Switching Corpus , 2014, LREC.

[17]  Riyaz Ahmad Bhat,et al.  IIIT-H System Submission for FIRE2014 Shared Task on Transliterated Search , 2014, FIRE.

[18]  Manoj Kumar Chinnakotla,et al.  "Answer ka type kya he?": Learning to Classify Questions in Code-Mixed Language , 2015, WWW.

[19]  Somnath Banerjee,et al.  The First Cross-Script Code-Mixed Question Answering Corpus , 2016, MultiLingMine@ECIR.

[20]  Partha Pakray,et al.  NLP-NITMZ @ MSIR 2016 System for Code-Mixed Cross-Script Question Classification , 2016, FIRE.