Reducing the size of training datasets in the classification of online discussions

Supervised machine learning models have been widely used to address the classification of messages in online discussions. Supervised learning algorithms require a large set of annotated data to accurately create a predictive model. However, data annotation is a complex task due to three factors: (i) depends on specialists to accurately label data; (ii) it is often a time-consuming and labour-intensive work,and(iii) in educational settings, it is not always easy to collect a substantial volume of data required by the machine learning algorithms. This paper presents an active learning-based approach that can reduce the amount of annotated data required to build machine learning models for the classification of educational data. The results obtained show that with only 20% of the annotated data, the proposed approach achieved similar results to those presented in the previous works that used the complete databases to train the machine learning model.