Handling Imbalance Issue in Hate Speech Classification using Sampling-based Methods

In a text classification problem, imbalance nature in dataset oftentimes has been disregarded even though it might significantly have an impact on the results of the classification model performance. This issue also occurred in the hate speech detection where most collected datasets are highly unbalanced. Among state-of-the-art methods that deal with classifying disparity data, the sampling-based technique is the most effective approach in classifying an imbalanced data. In this paper, four resampling methods include Random Oversampling (ROS), Synthetic Minority Technique (SMOTE), Adaptive Synthetic (ADASYN) and Random Undersampling (RUS) are used as an answer to the inequality of class distribution in a hate speech dataset. With three basic machine learning classifiers i.e. Support Vector Machine, Logistic Regression and Naïve Bayes, the evaluation results show that the oversampling approach improves the accuracy and the overall performance of three classifiers. Among all resampling techniques and machine learning algorithms, Logistic Regression enforced by ROS performed the best with an overall accuracy of 91 percent and F1-Score of 0.95.

[1]  Ingmar Weber,et al.  Automated Hate Speech Detection and the Problem of Offensive Language , 2017, ICWSM.

[2]  Dirk Hovy,et al.  Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter , 2016, NAACL.

[3]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[4]  Animesh Mukherjee,et al.  Spread of Hate Speech in Online Social Media , 2018, WebSci.

[5]  Michael Wiegand,et al.  A Survey on Hate Speech Detection using Natural Language Processing , 2017, SocialNLP@EACL.

[6]  Kevin Barraclough,et al.  I and i , 2001, BMJ : British Medical Journal.

[7]  R. Stephenson A and V , 1962, The British journal of ophthalmology.

[8]  Taghi M. Khoshgoftaar,et al.  Using Random Undersampling to Alleviate Class Imbalance on Tweet Sentiment Data , 2015, 2015 IEEE International Conference on Information Reuse and Integration.

[9]  Yang Wang,et al.  Cost-sensitive boosting for classification of imbalanced data , 2007, Pattern Recognit..

[10]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[11]  Robert Dyer,et al.  Classifying commit messages: A case study in resampling techniques , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[12]  Matthew Leighton Williams,et al.  Cyber Hate Speech on Twitter: An Application of Machine Classification and Statistical Modeling for Policy and Decision Making , 2015 .

[13]  Sérgio Nunes,et al.  A Survey on Automatic Detection of Hate Speech in Text , 2018, ACM Comput. Surv..

[14]  Roshani Ade,et al.  A Review on Imbalanced Learning Methods , 2015 .

[15]  Koby Crammer,et al.  On the Learnability and Design of Output Codes for Multiclass Problems , 2002, Machine Learning.

[16]  Seetha Hari,et al.  Learning From Imbalanced Data , 2019, Advances in Computer and Electrical Engineering.