Major setbacks for detection of abusive South African tweets are inadequacy of annotated corpus and high cost of annotation, which semi-supervised learning solves. Semi-supervised learning techniques enhance training data by combining labelled and unlabelled data. However, existing approaches have skewed classification of unlabelled data towards labelled data despite class imbalance of labelled data and unmatched feature distribution between labelled and testing data, that is common in abusive texts. This paper presents a reliable semi-supervised learning approach that reduces the noise in training data by combining features of unlabelled data with varying sizes of important features of labelled data. Chi-square statistics is used for the feature selection, while k-means algorithm is used for clustering of data points. By majority voting rule, reliable labels are assigned to the data points. Classifications with Support Vector Machine and Logistic Regression classifiers show that the proposed approach improves prediction performance.
[1]
Santosh Kumar,et al.
Cluster-than-Label: Semi-Supervised Approach for Domain Adaptation
,
2017,
2017 IEEE 31st International Conference on Advanced Information Networking and Applications (AINA).
[2]
Dong-Hyun Lee,et al.
Pseudo-Label : The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks
,
2013
.
[3]
Amparo Albalate,et al.
A semi-supervised cluster-and-label approach for utterance classification
,
2010,
INTERSPEECH.
[4]
B. Norton.
Identity and the Ownership of English
,
2018
.
[5]
Gaël Varoquaux,et al.
Scikit-learn: Machine Learning in Python
,
2011,
J. Mach. Learn. Res..