Automatic Normalization of Word Variations in Code-Mixed Social Media Text

Social media platforms such as Twitter and Facebook are becoming popular in multilingual societies. This trend induces portmanteau of South Asian languages with English. The blend of multiple languages as code-mixed data has recently become popular in research communities for various NLP tasks. Code-mixed data consist of anomalies such as grammatical errors and spelling variations. In this paper, we leverage the contextual property of words where the different spelling variation of words share similar context in a large noisy social media text. We capture different variations of words belonging to same context in an unsupervised manner using distributed representations of words. Our experiments reveal that preprocessing of the code-mixed dataset based on our approach improves the performance in state-of-the-art part-of-speech tagging (POS-tagging) and sentiment analysis tasks.

[1]  Xiang Zhang,et al.  Text Understanding from Scratch , 2015, ArXiv.

[2]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[3]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[4]  Pieter Muysken,et al.  Bilingual Speech: A Typology of Code-Mixing , 2000 .

[5]  Monojit Choudhury,et al.  Word-level Language Identification using CRF: Code-switching Shared Task Report of MSR India System , 2014, CodeSwitch@EMNLP.

[6]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[7]  Dipankar Das,et al.  Part-of-speech Tagging of Code-Mixed Social Media Text , 2016, CodeSwitch@EMNLP.

[8]  J. Gumperz Discourse strategies: Introduction , 1982 .

[9]  Luisa Duran,et al.  Toward a Better Understanding of Code-Switching and Interlanguage in Bilinguality: Implications for Bilingual Instruction. , 1994 .

[10]  Manish Shrivastava,et al.  Towards Sub-Word Level Compositions for Sentiment Analysis of Hindi-English Code Mixed Text , 2016, COLING.

[11]  M. Gysels French in urban Lubumbashi Swahili: Codeswitching, borrowing, or both? , 1992 .

[12]  Somnath Banerjee,et al.  Overview of FIRE-2015 Shared Task on Mixed Script Information Retrieval , 2015, FIRE Workshops.

[13]  Julia Hirschberg,et al.  Overview for the First Shared Task on Language Identification in Code-Switched Data , 2014, CodeSwitch@EMNLP.

[14]  Jatin Sharma,et al.  POS Tagging of English-Hindi Code-Mixed Social Media Content , 2014, EMNLP.

[15]  Rakesh Chandra Balabantaray,et al.  Text normalization of code mix and sentiment analysis , 2015, 2015 International Conference on Advances in Computing, Communications and Informatics (ICACCI).

[16]  Laurens van der Maaten,et al.  Learning a Parametric Embedding by Preserving Local Structure , 2009, AISTATS.

[17]  Joachim Wagner,et al.  Code Mixing: A Challenge for Language Identification in the Language of Social Media , 2014, CodeSwitch@EMNLP.

[18]  Lillian Lee,et al.  Opinion Mining and Sentiment Analysis , 2008, Found. Trends Inf. Retr..

[19]  Christopher D. Manning,et al.  Baselines and Bigrams: Simple, Good Sentiment and Topic Classification , 2012, ACL.

[20]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[21]  Vivek Kumar Rangarajan Sridhar Unsupervised Text Normalization Using Distributed Representations of Words and Phrases , 2015, VS@HLT-NAACL.