Sentiment analysis for mixed script Indic sentences

India is a multi-lingual and multi-script country. Developing natural language processing techniques for Indic languages is an active area of research. With the advent of social media, there has been an increasing trend of mixing different languages to convey thoughts in social media text. Users are more comfortable in their regionalistic language and tend to express their thoughts by mixing words from multiple languages. In this paper, we have attempted to develop a system for mining sentiments from code mixed sentences for English with combination of four other Indian languages (Tamil, Telugu, Hindi and Bengali). Due to the complex nature of the problem the technique used is divided into two stages, viz Language Identification and Sentiment Mining Approach. Evaluated results are compared to baseline obtained from machine translated sentences in English, and found to be around 8% better in terms of precision. The proposed approach is flexible and robust enough to handle additional languages for identification as well as anomalous foreign or extraneous words.

[1]  Sivaji Bandyopadhyay,et al.  Dr Sentiment Knows Everything! , 2011, ACL.

[2]  Rada Mihalcea,et al.  Learning Multilingual Subjective Language via Cross-Lingual Projections , 2007, ACL.

[3]  Monojit Choudhury,et al.  "ye word kis lang ka hai bhai?" Testing the Limits of Word level Language Identification , 2014, ICON.

[4]  Rupal Bhargava,et al.  Query Labelling for Indic Languages using a hybrid approach , 2015, FIRE Workshops.

[5]  Somnath Banerjee,et al.  Overview of FIRE-2015 Shared Task on Mixed Script Information Retrieval , 2015, FIRE Workshops.

[6]  Brendan T. O'Connor,et al.  Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters , 2013, NAACL.

[7]  Milam Aiken,et al.  An Analysis of Google Translate Accuracy , 2012 .

[8]  José Manuel Perea Ortega,et al.  Resource Creation and Evaluation for Multilingual Sentiment Analysis in Social Media Texts , 2014, LREC.

[9]  K. P. Soman,et al.  Cross-Lingual Preposition Disambiguation for Machine Translation , 2015 .

[10]  Ankush Mittal,et al.  Language Identification and Disambiguation in Indian Mixed-Script , 2016, ICDCIT.

[11]  Sivaji Bandyopadhyay,et al.  SentiWordNet for Indian Languages , 2010 .

[12]  Pushpak Bhattacharyya,et al.  Cross-Lingual Sentiment Analysis for Indian Languages using Linked WordNets , 2012, COLING.

[13]  Braja Gopal Patra,et al.  Shared Task on Sentiment Analysis in Indian Languages (SAIL) Tweets - An Overview , 2015, MIKE.

[14]  Pushpak Bhattacharyya,et al.  A Fall-back Strategy for Sentiment Analysis in Hindi: a Case Study , 2010 .

[15]  Christian Callegari,et al.  Advances in Computing, Communications and Informatics (ICACCI) , 2015 .

[16]  Sivaji Bandyopadhyay,et al.  Subjectivity Detection in English and Bengali: A CRF-based Approach , 2009 .

[17]  Rakesh Chandra Balabantaray,et al.  Text normalization of code mix and sentiment analysis , 2015, 2015 International Conference on Advances in Computing, Communications and Informatics (ICACCI).

[18]  Dipankar Das,et al.  Labeling of Query Words using Conditional Random Field , 2015, FIRE Workshops.

[19]  Parth Gupta,et al.  Query expansion for mixed-script information retrieval , 2014, SIGIR.

[20]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[21]  Amitava Das,et al.  Sentimantics: Conceptual Spaces for Lexical Sentiment Polarity Representation with Contextuality , 2012, WASSA@ACL.

[22]  Miguel A. Alonso,et al.  Sentiment Analysis on Monolingual, Multilingual and Code-Switching Twitter Corpora , 2015, WASSA@EMNLP.

[23]  K. M. Anil Kumar,et al.  Analysis of users’ Sentiments from Kannada Web Documents☆ , 2015 .

[24]  Urmila Shrawankar,et al.  Transliteration of Secured SMS to Indian Regional Language , 2016 .