Hatred and trolling detection transliteration framework using hierarchical LSTM in code-mixed social media text

The paper describes the usage of self-learning Hierarchical LSTM technique for classifying hatred and trolling contents in social media code-mixed data. The Hierarchical LSTM-based learning is a novel learning architecture inspired from the neural learning models. The proposed HLSTM model is trained to identify the hatred and trolling words available in social media contents. The proposed HLSTM systems model is equipped with self-learning and predicting mechanism for annotating hatred words in transliteration domain. The Hindi–English data are ordered into Hindi, English, and hatred labels for classification. The mechanism of word embedding and character-embedding features are used here for word representation in the sentence to detect hatred words. The method developed based on HLSTM model helps in recognizing the hatred word context by mining the intention of the user for using that word in the sentence. Wide experiments suggests that the HLSTM-based classification model gives the accuracy of 97.49% when evaluated against the standard parameters like BLSTM, CRF, LR, SVM, Random Forest and Decision Tree models especially when there are some hatred and trolling words in the social media data.

[1]  Huan Wang,et al.  Convolutional neural network based detection and judgement of environmental obstacle in vehicle operation , 2019, CAAI Trans. Intell. Technol..

[2]  Dong Nguyen,et al.  Word Level Language Identification in Online Multilingual Communication , 2013, EMNLP.

[3]  Sumam Mary Idicula,et al.  An Improved Word Representation for Deep Learning Based NER in Indian Languages , 2019, Inf..

[4]  Dilip Kumar Sharma,et al.  Artificial Immune Systems-Based Classification Model for Code-Mixed Social Media Data , 2020 .

[5]  Jatin Sharma,et al.  “I am borrowing ya mixing ?" An Analysis of English-Hindi Code Mixing in Facebook , 2014, CodeSwitch@EMNLP.

[6]  K. P. Soman,et al.  LSTM Based Paraphrase Identification Using Combined Word Embedding Features , 2019, Advances in Intelligent Systems and Computing.

[7]  Mayank Singh,et al.  PHINC: A Parallel Hinglish Social Media Code-Mixed Corpus for Machine Translation , 2020, WNUT.

[8]  Harsh Jhamtani,et al.  Word-level Language Identification in Bi-lingual Code-switched Texts , 2014, PACLIC.

[9]  Dilip Kumar Sharma,et al.  Language identification framework in code-mixed social media text based on quantum LSTM — the word belongs to which language? , 2020 .

[10]  Xuanjing Huang,et al.  Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data , 2014, Lecture Notes in Computer Science.

[11]  Sudeshna Sarkar,et al.  Using Communities of Words Derived from Multilingual Word Vectors for Cross-Language Information Retrieval in Indian Languages , 2018, ACM Trans. Asian Low Resour. Lang. Inf. Process..

[12]  Guesh Dagnew,et al.  Deep learning approach for microarray cancer data classification , 2020, CAAI Trans. Intell. Technol..

[13]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[14]  Sagara Sumathipala,et al.  Language identification at word level in Sinhala-English code-mixed social media text , 2019, 2019 International Research Conference on Smart Computing and Systems Engineering (SCSE).

[15]  Shashi Shekhar,et al.  Linguistic structural framework for encoding transliteration variants for word origin detection using bilingual lexicon , 2017, 2017 International Conference on Multimedia, Signal Processing and Communication Technologies (IMPACT).

[16]  Dilip Kumar Sharma,et al.  An effective cybernated word embedding system for analysis and language identification in code-mixed social media text , 2019, Int. J. Knowl. Based Intell. Eng. Syst..

[17]  Manik Sharma,et al.  Iconography : Stark Assessment of Lifestyle Based Human Disorders Using Data Mining Based Learning Techniques , 2017 .

[18]  Manjit Kaur,et al.  An efficient image encryption using non-dominated sorting genetic algorithm-III based 4-D chaotic maps , 2019, Journal of Ambient Intelligence and Humanized Computing.

[19]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[20]  Zied Lachiri,et al.  New Intraclass Helitrons Classification Using DNA-Image Sequences and Machine Learning Approaches , 2020 .

[21]  Gemma Boleda,et al.  Putting Words in Context: LSTM Language Models and Lexical Ambiguity , 2019, ACL.

[22]  Amitava Das,et al.  Identifying Languages at the Word Level in Code-Mixed Indian Social Media Text , 2014, ICON.

[23]  Vinay Singh,et al.  A Dataset of Hindi-English Code-Mixed Social Media Text for Hate Speech Detection , 2018, PEOPLES@NAACL-HTL.

[24]  Shahrul Azman Mohd Noah,et al.  A Proposed Method Using the Semantic Similarity of WordNet 3.1 to Handle the Ambiguity to Apply in Social Media Text , 2019, Lecture Notes in Electrical Engineering.

[25]  Gowri Srinivasa,et al.  NELIS - Named Entity and Language Identification System: Shared Task System Description , 2015, FIRE Workshops.

[26]  Jatin Sharma,et al.  POS Tagging of English-Hindi Code-Mixed Social Media Content , 2014, EMNLP.

[27]  Prasenjit Majumder,et al.  Approaches to Temporal Expression Recognition in Hindi , 2015, ACM Trans. Asian Low Resour. Lang. Inf. Process..

[28]  P SomanK.,et al.  AMRITA_CEN@FIRE 2016: Code-Mix Entity Extraction for Hindi-English and Tamil-English Tweets , 2016, FIRE.

[29]  Animesh Mukherjee,et al.  Spread of Hate Speech in Online Social Media , 2018, WebSci.

[30]  Niloy Ganguly,et al.  Identifying and Analyzing Different Aspects of English-Hindi Code-Switching in Twitter , 2019, ACM Trans. Asian Low Resour. Lang. Inf. Process..

[31]  Ben King,et al.  Labeling the Languages of Words in Mixed-Language Documents using Weakly Supervised Methods , 2013, NAACL.

[32]  Xu Yong,et al.  Three-stage network for age estimation , 2019 .

[33]  Piyush Kumar Shukla,et al.  Deep Transfer Learning Based Classification Model for COVID-19 Disease , 2020, IRBM.

[34]  Joachim Wagner,et al.  Code Mixing: A Challenge for Language Identification in the Language of Social Media , 2014, CodeSwitch@EMNLP.

[35]  Somnath Banerjee,et al.  Overview of the Mixed Script Information Retrieval (MSIR) at FIRE-2016 , 2016, FIRE.

[36]  Shashi Shekhar,et al.  Hindi Roman Linguistic Framework for Retrieving Transliteration Variants using Bootstrapping , 2018 .

[37]  Urmila Shrawankar,et al.  Transliteration of Secured SMS to Indian Regional Language , 2016 .

[38]  Rajeev Srivastava,et al.  Content-based image retrieval based on supervised learning and statistical-based moments , 2019, Modern Physics Letters B.

[39]  Fatiha Sadat,et al.  Low-Resource Machine Transliteration Using Recurrent Neural Networks of Asian Languages , 2018, NEWS@ACL.

[40]  Jaime G. Carbonell,et al.  White Paper on Natural Language Processing , 1989, HLT.

[41]  Heder S. Bernardino,et al.  Artificial Immune Systems for Optimization , 2009, Nature-Inspired Algorithms for Optimisation.

[42]  Monojit Choudhury,et al.  "ye word kis lang ka hai bhai?" Testing the Limits of Word level Language Identification , 2014, ICON.

[43]  Prashant Singh Rana,et al.  Performance Study of Evolutionary Algorithms for Structure Stability Analysis of Aln (n = 2–22) , 2016 .

[44]  Arkaitz Zubiaga,et al.  TweetLID: a benchmark for tweet language identification , 2016, Lang. Resour. Evaluation.

[45]  Manjit Kaur,et al.  Color image dehazing using gradient channel prior and guided L0 filter , 2020, Inf. Sci..

[46]  Vaishali,et al.  Classification of COVID-19 patients from chest CT images using multi-objective differential evolution–based convolutional neural networks , 2020, European Journal of Clinical Microbiology & Infectious Diseases.

[47]  Preslav Nakov,et al.  SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval) , 2019, *SEMEVAL.

[48]  Shashi Shekhar,et al.  Embedding Framework for Identifying Ambiguous Words in Code-Mixed Social Media Text , 2019, 2019 International Conference on contemporary Computing and Informatics (IC3I).

[49]  Chinthaka Premachandra,et al.  Word Level Language Identification of Code Mixing Text in Social Media using NLP , 2018, 2018 3rd International Conference on Information Technology Research (ICITR).

[50]  Zhiyuan Liu,et al.  Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data: 14th China National Conference, CCL 2015 and Third International Symposium, NLP-NABD 2015, Guangzhou, China, November 13-14, 2015, Proceedings , 2015, Lecture Notes in Computer Science.

[51]  K. V. Arya,et al.  Feature selection for image steganalysis using levy flight-based grey wolf optimization , 2018, Multimedia Tools and Applications.

[52]  Rupal Bhargava,et al.  Sentiment analysis for mixed script Indic sentences , 2016, 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI).

[53]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Manjit Kaur,et al.  Adaptive Differential Evolution-Based Lorenz Chaotic System for Image Encryption , 2018, Arabian Journal for Science and Engineering.

[55]  T. Nagarajan,et al.  Pause-Based Phrase Extraction and Effective OOV Handling for Low-Resource Machine Translation Systems , 2018, ACM Trans. Asian Low Resour. Lang. Inf. Process..

[56]  Pradeep Singh,et al.  Enhancing Aggression Detection using GPT-2 based Data Balancing Technique , 2021, 2021 5th International Conference on Intelligent Computing and Control Systems (ICICCS).

[57]  Somnath Banerjee,et al.  Overview of FIRE-2015 Shared Task on Mixed Script Information Retrieval , 2015, FIRE Workshops.

[58]  Sergey I. Nikolenko,et al.  Word Embeddings for User Profiling in Online Social Networks , 2017, Computación y Sistemas.

[59]  Devshree Patel,et al.  Language Identification and Translation of English and Gujarati code-mixed data , 2020, 2020 International Conference on Emerging Trends in Information Technology and Engineering (ic-ETITE).

[60]  Taha H. Rassem,et al.  A Review of Recent Trends: Text Mining of Taxonomy Using WordNet 3.1 for the Solution and Problems of Ambiguity in Social Media , 2020 .