Detecting Hate, Offensive, and Regular Speech in Short Comments

The freedom of expression provided by the Internet also favors malicious groups that propagate contents of hate, recruit new members, and threaten users. In this context, we propose a new approach for hate speech identification based on Information Theory quantifiers (entropy and divergence) to represent documents. As a differential of our approach, we capture weighted information of words, rather than just their frequency in documents. The results show that our approach overperforms techniques that use data representation, such as TF-IDF and unigrams combined to text classifiers, achieving an F1-score of 86%, 84% e 96% for classifying hate, offensive, and regular speech classes, respectively. Compared to the baselines, our proposal is a win-win solution that improves efficacy (F1-score) and efficiency (by reducing the dimension of the feature vector). The proposed solution is up to 2.27 times faster than the baseline.

[1]  Jennifer Jie Xu,et al.  Mining communities and their relationships in blogs: A study of online hate groups , 2007, Int. J. Hum. Comput. Stud..

[2]  Raphael Cohen-Almagor,et al.  Fighting Hate and Bigotry on the Internet , 2011 .

[3]  Fabrício Benevenuto,et al.  A Measurement Study of Hate Speech in Social Media , 2017, HT.

[4]  Jussara M. Almeida,et al.  Polarity analysis of micro reviews in foursquare , 2013, WebMedia.

[5]  Olinda Nogueira Paes Cardoso Recuperação de Informação. , 2004 .

[6]  Rogers Prates de Pelle,et al.  Offensive Comments in the Brazilian Web: a dataset and baseline results , 2017 .

[7]  Andrew McCallum,et al.  Polylingual Topic Models , 2009, EMNLP.

[8]  Ashish Sureka,et al.  Applying Social Media Intelligence for Predicting and Identifying On-line Radicalization and Civil Unrest Oriented Threats , 2015, ArXiv.

[9]  Ingmar Weber,et al.  Automated Hate Speech Detection and the Problem of Offensive Language , 2017, ICWSM.

[10]  James Hawdon,et al.  Applying differential association theory to online hate groups: a theoretical statement , 2012 .

[11]  Eduardo Freire Nakamura,et al.  For or Against?: Polarity Analysis in Tweets about Impeachment Process of Brazil President , 2016, WebMedia.

[12]  Bo Pang,et al.  Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[13]  Hsinchun Chen,et al.  Cyber extremism in Web 2.0: An exploratory study of international Jihadist groups , 2008, 2008 IEEE International Conference on Intelligence and Security Informatics.