A robust random forest-based tri-training algorithm for early in-trouble student prediction

Educational data mining has received much attention worldwide due to its significance in the education domain. Among a large number of the educational data mining tasks, early in-trouble student prediction is a popular one. This task focuses on identifying the students who are at risk in their study as soon as possible before the end of the permitted period of study time. For early detection, data shortage is a challenge for the task at both instance and set levels. Indeed, at the instance level, incomplete data could be gathered for each student at his/her early study period and also at the set level, many labeled data could not be collected for their final study status. Therefore, a solution to the task in such a context is required. In this paper, we propose a robust random forest-based Tri-training algorithm that can overcome that data shortage challenge. In particular, based on the semi-supervised learning process of the original Tri-training algorithm, an incomplete data handling method is integrated into its iterative mechanism so that the Tri-training algorithm can be more robust. In addition, a new combination of the Tri-training algorithm and a random forest model is examined so that each classifier of the Tri-training model can be enhanced for more accurate predictions. As a result, the proposed algorithm is an effective solution to the early in-trouble student prediction task. Its effectiveness has been confirmed with the better experimental results on real data sets in comparison with the existing methods using the preprocessing approach.

[1]  Francisco Herrera,et al.  Self-labeled techniques for semi-supervised learning: taxonomy, software and empirical study , 2015, Knowledge and Information Systems.

[2]  Sebastián Ventura,et al.  Web usage mining for predicting final marks of students that use Moodle courses , 2013, Comput. Appl. Eng. Educ..

[3]  Burr Settles,et al.  Active Learning , 2012, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[4]  Mirka Saarela,et al.  Analysing Student Performance using Sparse Data of Core Bachelor Courses , 2015, EDM 2015.

[5]  Carlos Márquez-Vera,et al.  Predicting student failure at school using genetic programming and different data mining approaches with high dimensional and imbalanced data , 2013, Applied Intelligence.

[6]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[7]  Sotiris B. Kotsiantis,et al.  Estimating student dropout in distance higher education using semi-supervised techniques , 2015, Panhellenic Conference on Informatics.

[8]  Hamideh Afsarmanesh,et al.  Semi-supervised self-training for decision tree classifiers , 2017, Int. J. Mach. Learn. Cybern..

[9]  Zhi-Hua Zhou,et al.  Improve Computer-Aided Diagnosis With Machine Learning Techniques Using Undiagnosed Samples , 2007, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[10]  Francisco Herrera,et al.  SEG-SSC: A Framework Based on Synthetic Examples Generation for Self-Labeled Semi-Supervised Classification , 2015, IEEE Transactions on Cybernetics.

[11]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[12]  Irena Koprinska,et al.  Predicting Student Performance from Multiple Data Sources , 2015, AIED.

[13]  James C. Bezdek,et al.  Fuzzy c-means clustering of incomplete data , 2001, IEEE Trans. Syst. Man Cybern. Part B.

[14]  Bart Baesens,et al.  Gaining insight into student satisfaction using comprehensible data mining techniques , 2012, Eur. J. Oper. Res..

[15]  Katia Kermanidis,et al.  Success Is Hidden in the Students' Data , 2012, AIAI.

[16]  Lubos Popelínský,et al.  Predicting drop-out from social behaviour of students , 2012, EDM.

[17]  Korris Fu-Lai Chung,et al.  Semi-supervised classification method through oversampling and common hidden space , 2016, Inf. Sci..

[18]  Edwin Lughofer,et al.  On-line active learning: A new paradigm to improve practical useability of data stream modeling methods , 2017, Inf. Sci..

[19]  Alejandro Peña-Ayala,et al.  Educational data mining , 2014 .

[20]  Zhi-Hua Zhou,et al.  Tri-training: exploiting unlabeled data using three classifiers , 2005, IEEE Transactions on Knowledge and Data Engineering.

[21]  S. Taruna,et al.  An empirical analysis of classification techniques for predicting academic performance , 2014, 2014 IEEE International Advance Computing Conference (IACC).

[22]  Alejandro Peña-Ayala Review: Educational data mining: A survey and a data mining-based analysis of recent works , 2014 .