PHINC: A Parallel Hinglish Social Media Code-Mixed Corpus for Machine Translation

Code-mixing is the phenomenon of using more than one language in a sentence. It is a very frequently observed pattern of communication on social media platforms. Flexibility to use multiple languages in one text message might help to communicate efficiently with the target audience. But, it adds to the challenge of processing and understanding natural language to a much larger extent. This paper presents a parallel corpus of the 13,738 code-mixed English-Hindi sentences and their corresponding translation in English. The translations of sentences are done manually by the annotators. We are releasing the parallel corpus to facilitate future research opportunities in code-mixed machine translation. The annotated corpus is available at this https URL.

[1]  Manish Shrivastava,et al.  Towards Sub-Word Level Compositions for Sentiment Analysis of Hindi-English Code Mixed Text , 2016, COLING.

[2]  Joachim Wagner,et al.  Code Mixing: A Challenge for Language Identification in the Language of Social Media , 2014, CodeSwitch@EMNLP.

[3]  James Lambert A multitude of “lishes”: The nomenclature of hybridity , 2017 .

[4]  Vinay Singh,et al.  Named Entity Recognition for Hindi-English Code-Mixed Social Media Text , 2018, NEWS@ACL.

[5]  R. Sinha,et al.  Machine Translation of Bi-lingual Hindi-English (Hinglish) Text , 2005, MTSUMMIT.

[6]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[7]  Manish Shrivastava,et al.  Enabling Code-Mixed Translation: Parallel Corpus Creation and MT Augmentation Approach , 2018 .

[8]  Monojit Choudhury,et al.  GLUECoS: An Evaluation Benchmark for Code-Switched NLP , 2020, ACL.

[9]  Vinay Singh,et al.  A Corpus of English-Hindi Code-Mixed Tweets for Sarcasm Detection , 2018, ArXiv.

[10]  Yong Wang,et al.  Meta-Learning for Low-Resource Neural Machine Translation , 2018, EMNLP.

[11]  Raj Dabre,et al.  Exploiting Multilingualism through Multistage Fine-Tuning for Low-Resource Neural Machine Translation , 2019, EMNLP.

[12]  Amitava Das,et al.  Identifying Languages at the Word Level in Code-Mixed Indian Social Media Text , 2014, ICON.

[13]  Jatin Sharma,et al.  POS Tagging of English-Hindi Code-Mixed Social Media Content , 2014, EMNLP.