Identification of hate speech and abusive language on indonesian Twitter using the Word2vec, part of speech and emoji features

Freedom of speech for the people of Indonesia on social media makes the spread of hate speech and abusive language inevitable. If there is no proper handling, this will lead to social disharmony between individuals and communities. The identification of hate speech and abusive language on Twitter in the Indonesian language is quite challenging. Because of its ability to understand the meaning of a sentence, semantic features such as word embedding can be relied on to understand tweets that contain hateful and abusive words. In this study, word embedding (word2vec) feature and its combinations with part of speech and/or emoji were used to identify hate speech and abusive language on Twitter in the Indonesian language. Furthermore, some combinations of unigram with part of speech and/or emojis were also utilized during the experiment and the results were studied. The classification algorithms used in this study were Support Vector Machine, Random Forest Decision Tree, and Logistic Regression. The combination of unigram features, part of speech and emoji obtained the highest accuracy value of 79.85% with F-Measure of 87.51%.

[1]  Indra Budi,et al.  Multi-label Hate Speech and Abusive Language Detection in Indonesian Twitter , 2019, Proceedings of the Third Workshop on Abusive Language Online.

[2]  Felice Dell'Orletta,et al.  Hate Me, Hate Me Not: Hate Speech Detection on Facebook , 2017, ITASEC.

[3]  Indra Budi,et al.  A Dataset and Preliminaries Study for Abusive Language Detection in Indonesian Social Media , 2018 .

[4]  Ruli Manurung,et al.  Designing an Indonesian part of speech tagset and manually tagged Indonesian corpus , 2014, 2014 International Conference on Asian Language Processing (IALP).

[5]  Passent El Kafrawy,et al.  Experimental Comparison of Methods for Multi-label Classification in different Application Domains , 2015 .

[6]  Patrice Bellot,et al.  From Emojis to Sentiment Analysis , 2016 .

[7]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[8]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[9]  Manoj Kumar Chinnakotla,et al.  Deep learning for detecting inappropriate content in text , 2018, International Journal of Data Science and Analytics.

[10]  Dirk Hovy,et al.  Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter , 2016, NAACL.

[11]  Kalina Bontcheva,et al.  Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines , 2014, LREC.

[12]  Ika Alfina,et al.  Hate speech detection in the Indonesian language: A dataset and preliminary study , 2017, 2017 International Conference on Advanced Computer Science and Information Systems (ICACSIS).

[13]  Jarernsri L. Mitrpanont,et al.  Automatic Discovery of Abusive Thai Language Usages in Social Networks , 2017, ICADL.

[14]  Novita Hanafiah,et al.  Text Normalization Algorithm on Twitter in Complaint Category , 2017, ICCSCI.