Machine Learning based Language Modelling of Code Switched Data

With the rapid increase of internet users all over the world, social media platforms have risen at a tremendous pace. Code-switched languages (When the speaker alternates between two or more languages eg. Hinglish, Hindi words written in English) are a popular medium of communication on social media. They are characterized by the lack of grammatical structure and variation in spellings. These linguistic constraints combined with lack of data cause ambiguity making the task of text classification on code-switched data difficult. In this paper, we have proposed a Language Modelling (LM) based approach to text classification of Hinglish text. We approach this problem by building a Universal Language Model Fine-tuning using AWD-LSTM architecture on a Hindi-English code-switched (Hinglish) corpus collected from various blogging sites. The language model is able to encode important information about the code-switched data and can be quickly fine-tuned on a given Hinglish dataset and achieve good results. We evaluated the performance of our model on the code-switched aggression detection TRAC-1 dataset, Hinglish Offensive Tweet (HOT) dataset and humour-classification dataset. Experiments on these datasets using our proposed method were able to surpass the previously reported results.

[1]  Sivaji Bandyopadhyay,et al.  SentiWordNet for Indian Languages , 2010 .

[2]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[3]  Hermann Ney,et al.  LSTM Neural Networks for Language Modeling , 2012, INTERSPEECH.

[4]  Ritesh Kumar,et al.  Aggression-annotated Corpus of Hindi-English Code-mixed Data , 2018, LREC.

[5]  Kumar Ravi,et al.  Sentiment classification of Hinglish text , 2016, 2016 3rd International Conference on Recent Advances in Information Technology (RAIT).

[6]  Nishant Nikhil,et al.  LSTMs with Attention for Aggression Detection , 2018, TRAC@COLING 2018.

[7]  Preethi Jyothi,et al.  Code-switched Language Models Using Dual RNNs and Same-Source Pretraining , 2018, EMNLP.

[8]  Aditya Malte,et al.  Multilingual Cyber Abuse Detection using Advanced Transformer Architecture , 2019, TENCON 2019 - 2019 IEEE Region 10 Conference (TENCON).

[9]  Pushpak Bhattacharyya,et al.  A Fall-back Strategy for Sentiment Analysis in Hindi: a Case Study , 2010 .

[10]  Henry Lieberman,et al.  Modeling the Detection of Textual Cyberbullying , 2011, The Social Mobile Web.

[11]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[12]  Thamar Solorio,et al.  RiTUAL-UH at TRAC 2018 Shared Task: Aggression Identification , 2018, TRAC@COLING 2018.

[13]  Namita Mittal,et al.  Sentiment Analysis of Hindi Reviews based on Negation and Discourse Relation , 2013 .

[14]  Björn Ross,et al.  Measuring the Reliability of Hate Speech Annotations: The Case of the European Refugee Crisis , 2016, ArXiv.

[15]  Constantin Orasan,et al.  Aggressive Language Identification Using Word Embeddings and Sentiment Features , 2018, TRAC@COLING 2018.

[16]  Walid Magdy,et al.  Abusive Language Detection on Arabic Social Media , 2017, ALW@ACL.

[17]  Sudip Kumar Naskar,et al.  Aggression Detection on Multilingual Social Media Text , 2019, 2019 10th International Conference on Computing, Communication and Networking Technologies (ICCCNT).