Automatic detection of abusive South African tweets using a semi-supervised learning approach

Major setbacks for detection of abusive South African tweets are inadequacy of annotated corpus and high cost of annotation, which semi-supervised learning solves. Semi-supervised learning techniques enhance training data by combining labelled and unlabelled data. However, existing approaches have skewed classification of unlabelled data towards labelled data despite class imbalance of labelled data and unmatched feature distribution between labelled and testing data, that is common in abusive texts. This paper presents a reliable semi-supervised learning approach that reduces the noise in training data by combining features of unlabelled data with varying sizes of important features of labelled data. Chi-square statistics is used for the feature selection, while k-means algorithm is used for clustering of data points. By majority voting rule, reliable labels are assigned to the data points. Classifications with Support Vector Machine and Logistic Regression classifiers show that the proposed approach improves prediction performance.