The First Cross-Script Code-Mixed Question Answering Corpus

In this paper, we formally introduce the problem of crossscript code-mixed question answering (QA) and we elaborate the corpus acquisition process and an evaluation strategy related to the said problem. Today social media platforms are flooded by millions of posts everyday on various topics. This paper emphasizes the use of such ever growing user generated content to serve as information collection source for the QA task on a low-resource language for the first time. A majority of these posts are multilingual in nature and many of them involve code mixing. The multilingual aspect of social media content is reflected in the use of multilingual words as well as in the writing script. For the ease of use multilingual users often pose questions in non-native script. Focusing on this current multilingual scenario, code-mixed cross-script (i.e., non-native script) data give rise to a new problem and present serious challenges to automatic QA. In the work presented in this paper, Bengali is considered as the native language while English is considered to be the non-native language. However, the dataset construction approach presented in this paper is generic in nature and could be used for any other language pair. Apart from introducing this novel problem, this paper highlights corpus development process and a suitable evaluation framework.

[1]  Parth Gupta,et al.  Query expansion for mixed-script information retrieval , 2014, SIGIR.

[2]  Adwait Ratnaparkhi,et al.  IBM's Statistical Question Answering System , 2000, TREC.

[3]  Tetsuya Sakai,et al.  ASKMi: A Japanese Question Answering System based on Semantic Role Analysis , 2004, RIAO.

[4]  Katsuhito Sudoh,et al.  NTT's Japanese-English Cross-Language Question Answering System , 2005, NTCIR.

[5]  Somnath Banerjee,et al.  Bengali Question Classification: Towards Developing QA System , 2012, WSSANLP@COLING.

[6]  Ellen M. Voorhees,et al.  The TREC-8 Question Answering Track Report , 1999, TREC.

[7]  Chen Xinqing Internet-based Chinese Question-answering System , 2003 .

[8]  Yuan Bao-zong Chinese Question Answering Based on Syntax Analysis and Answer Classification , 2008 .

[9]  Susan T. Dumais,et al.  An Analysis of the AskMSR Question-Answering System , 2002, EMNLP.

[10]  Somnath Banerjee,et al.  BFQA: A Bengali Factoid Question Answering System , 2014, TSD.

[11]  Jagadeesh Gorla,et al.  Identification of Languages and Encodings in a Multilingual Document , 2007 .

[12]  Gokul Chittaranjan,et al.  Overview of FIRE 2014 Track on Transliterated Search , 2014 .

[13]  Zhiping Zheng,et al.  AnswerBus question answering system , 2002 .

[14]  Ben King,et al.  Labeling the Languages of Words in Mixed-Language Documents using Weakly Supervised Methods , 2013, NAACL.

[15]  Anselmo Peñas,et al.  A Simple Measure to Assess Non-response , 2011, ACL.

[16]  Ghassan Kanaan,et al.  A New Question Answering System for the Arabic Language , 2009 .

[17]  Paolo Rosso,et al.  Answering questions with an n-gram based passage retrieval engine , 2009, Journal of Intelligent Information Systems.

[18]  Somnath Banerjee,et al.  The First Resource for Bengali Question Answering Research , 2014, PolTAL.

[19]  Somnath Banerjee,et al.  A Hybrid Approach for Transliterated Word-Level Language Identification: CRF with Post-Processing Heuristics , 2014, FIRE.

[20]  F. A. Mohammed,et al.  A knowledge based Arabic question answering system (AQAS) , 1993, SGAR.