Language Identification and Translation of English and Gujarati code-mixed data

Code-mixing is a growing field of research in the domain of Natural Language Processing. Communication on social media involves code-mixed texts, colloquial language, and variations in spellings. An amalgamation of different languages in the conversation results in transliteration and romanization. Presently there is a deficit amount of open-source resources for code-mixed English-Gujarati data, while the amount of such data is exponentially increasing every-day with growing social-media users. In this paper, we present our creation of a linguistic resource for English and Gujarati. We also present our approach of language identification and normalization of data along with the translation of the transliterated text into native language form.

[1]  Jatin Sharma,et al.  POS Tagging of English-Hindi Code-Mixed Social Media Content , 2014, EMNLP.

[2]  Manish Shrivastava,et al.  Automatic Normalization of Word Variations in Code-Mixed Social Media Text , 2018, CICLing.

[3]  Dong Nguyen,et al.  Word Level Language Identification in Online Multilingual Communication , 2013, EMNLP.

[4]  Vadlamani Ravi,et al.  Language Identification in Mixed Script , 2017, FIRE.

[5]  Ponnurangam Kumaraguru,et al.  Language Identification and Named Entity Recognition in Hinglish Code Mixed Tweets , 2018, ACL.

[6]  Timothy Baldwin,et al.  Language Identification: The Long and the Short of the Matter , 2010, NAACL.

[7]  Jatin Sharma,et al.  “I am borrowing ya mixing ?" An Analysis of English-Hindi Code Mixing in Facebook , 2014, CodeSwitch@EMNLP.

[8]  Dinkar Sitaram,et al.  Sentiment analysis of mixed language employing Hindi-English code switching , 2015, 2015 International Conference on Machine Learning and Cybernetics (ICMLC).

[9]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[10]  Amitava Das,et al.  Identifying Languages at the Word Level in Code-Mixed Indian Social Media Text , 2014, ICON.

[11]  Somnath Banerjee,et al.  Text normalization in code-mixed social media text , 2015, 2015 IEEE 2nd International Conference on Recent Trends in Information Systems (ReTIS).

[12]  Dipti Misra Sharma,et al.  Shallow Parsing Pipeline - Hindi-English Code-Mixed Social Media Text , 2016, NAACL.

[13]  F. Rudzicz Human Language Technologies : The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics , 2010 .

[14]  Amitava Das,et al.  Part-of-Speech Tagging for Code-Mixed English-Hindi Twitter and Facebook Chat Messages , 2015, RANLP.

[15]  Rakesh Chandra Balabantaray,et al.  Text normalization of code mix and sentiment analysis , 2015, 2015 International Conference on Advances in Computing, Communications and Informatics (ICACCI).

[16]  Riyaz Ahmad Bhat,et al.  IIIT-H System Submission for FIRE2014 Shared Task on Transliterated Search , 2014, FIRE.

[17]  Timothy Baldwin,et al.  Automatic Detection and Language Identification of Multilingual Documents , 2014, TACL.

[18]  R. Sinha,et al.  Machine Translation of Bi-lingual Hindi-English (Hinglish) Text , 2005, MTSUMMIT.

[19]  Manish Shrivastava,et al.  Sentiment Analysis of Code-Mixed Languages leveraging Resource Rich Languages , 2018, CICLing.