Comparing Different Resampling Methods in Predicting Students’ Performance Using Machine Learning Techniques

In today’s world, due to the advancement of technology, predicting the students’ performance is among the most beneficial and essential research topics. Data Mining is extremely helpful in the field of education, especially for analyzing students’ performance. It is a fact that predicting the students’ performance has become a severe challenge because of the imbalanced datasets in this field, and there is not any comparison among different resampling methods. This paper attempts to compare various resampling techniques such as Borderline SMOTE, Random Over Sampler, SMOTE, SMOTE-ENN, SVM-SMOTE, and SMOTE-Tomek to handle the imbalanced data problem while predicting students’ performance using two different datasets. Moreover, the difference between multiclass and binary classification, and structures of the features are examined. To be able to check the performance of the resampling methods better in solving the imbalanced problem, this paper uses various machine learning classifiers including Random Forest, K-Nearest-Neighbor, Artificial Neural Network, XG-boost, Support Vector Machine (Radial Basis Function), Decision Tree, Logistic Regression, and Naïve Bayes. Furthermore, the Random hold-out and Shuffle 5-fold cross-validation methods are used as model validation techniques. The achieved results using different evaluation metrics indicate that fewer numbers of classes and nominal features will lead models to better performance. Also, classifiers do not perform well with imbalanced data, so solving this problem is necessary. The performance of classifiers is improved using balanced datasets. Additionally, the results of the Friedman test, which is a statistical significance test, confirm that the SVM-SMOTE is more efficient than the other resampling methods. Moreover, The Random Forest classifier has achieved the best result among all other models while using SVM-SMOTE as a resampling method.

[1]  David A. Landgrebe,et al.  A survey of decision tree classifier methodology , 1991, IEEE Trans. Syst. Man Cybern..

[2]  Phayung Meesad,et al.  A critical assessment of imbalanced class distribution problem: The case of predicting freshmen student attrition , 2014, Expert Syst. Appl..

[3]  Vaibhav Kumar,et al.  Comparison of Machine Learning Models in Student Result Prediction , 2018, International Conference on Advanced Computing Networking and Informatics.

[4]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[5]  Dorina Kabakchieva,et al.  Student Performance Prediction by Using Data Mining Classification Algorithms , 2012 .

[6]  Carlos Márquez-Vera,et al.  Predicting student failure at school using genetic programming and different data mining approaches with high dimensional and imbalanced data , 2013, Applied Intelligence.

[7]  Sarah Jane Delany k-Nearest Neighbour Classifiers , 2007 .

[8]  Dimitris Kanellopoulos,et al.  Data Preprocessing for Supervised Leaning , 2007 .

[9]  Leon N. Cooper,et al.  Improving nearest neighbor rule with a simple adaptive distance measure , 2006, Pattern Recognit. Lett..

[10]  H. Lilliefors On the Kolmogorov-Smirnov Test for Normality with Mean and Variance Unknown , 1967 .

[11]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[12]  Yoav Freund,et al.  A Short Introduction to Boosting , 1999 .

[13]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[14]  Robert M. Haralick,et al.  Feature normalization and likelihood-based similarity measures for image retrieval , 2001, Pattern Recognit. Lett..

[15]  Zlatko J. Kovacic,et al.  Early Prediction of Student Success: Mining Students Enrolment Data , 2010 .

[16]  A. Elhassan,et al.  Classification of Imbalance Data using Tomek Link (T-Link) Combined with Random Under-sampling (RUS) as a Data Reduction Method , 2017 .

[17]  Wu Zhang,et al.  Using machine learning to predict student difficulties from learning session data , 2018, Artificial Intelligence Review.

[18]  Pei-Chann Chang,et al.  Parametric prediction on default risk of Chinese listed tourism companies by using random oversampling, isomap, and locally linear embeddings on imbalanced samples , 2013 .

[19]  M. Friedman The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance , 1937 .

[20]  Rouzbeh Ghousi,et al.  Predictive data mining approaches in medical diagnosis: A review of some diseases prediction , 2019, International Journal of Data and Network Science.

[21]  Chia-Lun Lo,et al.  Developing early warning systems to predict students' online learning performance , 2014, Comput. Hum. Behav..

[22]  Thomas J. Watson,et al.  An empirical study of the naive Bayes classifier , 2001 .

[23]  Abeer Badr El Din Ahmed,et al.  Data Mining: A prediction for Student's Performance Using Classification Method , 2014 .

[24]  Etinosa Noma-Osaghae,et al.  Data mining approach to predicting the performance of first year student in a university using the admission requirements , 2018, Education and Information Technologies.

[25]  Stamos T. Karamouzis,et al.  An Artificial Neural Network for Predicting Student Graduation Outcomes , 2008 .

[26]  Vinayak Hegde,et al.  Prediction of students performance using Educational Data Mining , 2016, 2016 International Conference on Data Mining and Advanced Computing (SAPIENCE).

[27]  Anal Acharya,et al.  Early Prediction of Students Performance using Machine Learning Techniques , 2014 .

[28]  Johan A. K. Suykens,et al.  Least Squares Support Vector Machine Classifiers , 1999, Neural Processing Letters.

[29]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[30]  Paulo Cortez,et al.  Using data mining to predict secondary school student performance , 2008 .

[31]  M. Friedman A Comparison of Alternative Tests of Significance for the Problem of $m$ Rankings , 1940 .

[32]  Michael Y. Hu,et al.  Forecasting with artificial neural networks: The state of the art , 1997 .

[33]  Dorina Kabakchieva,et al.  Predicting Student Performance by Using Data Mining Methods for Classification , 2013 .

[34]  M. S. Bartlett,et al.  Statistical methods and scientific inference. , 1957 .

[35]  I A Basheer,et al.  Artificial neural networks: fundamentals, computing, design, and application. , 2000, Journal of microbiological methods.

[36]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[37]  Yanqing Zhang,et al.  SVMs Modeling for Highly Imbalanced Classification , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[38]  Kuntal Kumar Pal,et al.  Preprocessing for image classification by convolutional neural networks , 2016, 2016 IEEE International Conference on Recent Trends in Electronics, Information & Communication Technology (RTEICT).

[39]  Dimitris Kanellopoulos,et al.  Handling imbalanced datasets: A review , 2006 .

[40]  Rushi Longadge,et al.  Class Imbalance Problem in Data Mining Review , 2013, ArXiv.

[41]  Anil K. Jain,et al.  Artificial Neural Networks: A Tutorial , 1996, Computer.

[42]  Farshid Marbouti,et al.  Models for early prediction of at-risk students in a course using standards-based grading , 2016, Comput. Educ..

[43]  Donghai Guan,et al.  Nearest neighbor editing aided by unlabeled data , 2009, Inf. Sci..

[44]  Jakub M. Tomczak,et al.  Ensemble boosted trees with synthetic features generation in application to bankruptcy prediction , 2016, Expert Syst. Appl..

[45]  Simon Fong,et al.  An Application of Oversampling, Undersampling, Bagging and Boosting in Handling Imbalanced Datasets , 2013, DaEng.

[46]  Sebastián Ventura,et al.  Educational Data Mining: A Review of the State of the Art , 2010, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[47]  Wenliang Du,et al.  Building decision tree classifier on private data , 2002 .

[48]  Piet Kommers,et al.  A Review of Educational Data Mining Tools & Techniques , 2018 .

[49]  Sebastián Ventura,et al.  Educational data mining: A survey from 1995 to 2005 , 2007, Expert Syst. Appl..

[50]  D. Cox The Regression Analysis of Binary Sequences , 1958 .

[51]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[52]  Darielson Araujo de Souza,et al.  Using neural networks to predict the future performance of students , 2015, 2015 International Symposium on Computers in Education (SIIE).

[53]  Mahmoud Abu Ghosh,et al.  Predicting Student Performance Using Artificial Neural Network: in the Faculty of Engineering and Information Technology , 2015 .

[54]  Alaa M. El-Halees,et al.  Mining educational data to improve students' performance: a case study , 2012 .

[55]  Zdenek Zdráhal,et al.  Ouroboros: early identification of at-risk students without models based on legacy data , 2017, LAK.

[56]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .

[57]  Ryan S. Baker,et al.  The State of Educational Data Mining in 2009: A Review and Future Visions. , 2009, EDM 2009.

[58]  Umar Manzoor,et al.  Modeling and Predicting Students' Academic Performance Using Data Mining Techniques , 2016 .

[59]  Rommel N. Carvalho,et al.  Educational data mining: Predictive analysis of academic performance of public school students in the capital of Brazil , 2019, Journal of Business Research.