Massive LMS log data analysis for the early prediction of course-agnostic student performance

Abstract The early prediction of students’ performance is a valuable resource to improve their learning. If we are able to detect at-risk students in the initial stages of the course, we will have more time to improve their performance. Likewise, excellent students could be motivated with customized additional activities. This is why there are research works aimed to early detect students’ performance. Some of them try to achieve it with the analysis of LMS log files, which store information about student interaction with the LMS. Many works create predictive models with the log files generated for the whole course, but those models are not useful for early prediction because the actual log information used for predicting is different to the one used to train the models. Other works do create predictive models with the log information retrieved at the early stages of courses, but they are just focused on a particular type of course. In this work, we use machine learning to create models for the early prediction of students’ performance in solving LMS assignments, by just analyzing the LMS log files generated up to the moment of prediction. Moreover, our models are course agnostic, because the datasets are created with all the University of UniversityName 1 courses for one academic year. We predict students’ performance at 10%, 25%, 33% and 50% of the course length. Our objective is not to predict the exact student’s mark in LMS assignments, but to detect at-risk, fail and excellent students in the early stages of the course. That is why we create different classification models for each of those three student groups. Decision tree, nave Bayes, logistic regression, multilayer perceptron (MLP) neural network, and support vector machine models are created and evaluated. Accuracies of all the models grow as the moment of prediction increases. Although all the algorithms but nave Bayes show accuracy differences lower than 5%, MLP obtains the best performance: from 80.1% accuracy when 10% of the course has been delivered to 90.1% when half of it has taken place. We also discuss the LMS log entries that most influence the students’ performance. By using a clustering algorithm, we detect six different clusters of students regarding their interaction with the LMS. Analyzing the interaction patterns of each cluster, we find that those patterns are repeated in all the early stages of the course. Finally, we show how four out of those six student-LMS interaction patterns have a strong correlation with students’ performance.

[1]  Ke Zhang,et al.  Revealing Online Learning Behaviors and Activity Patterns and Making Predictions with Data Mining Techniques in Online Teaching , 2008 .

[2]  Peter Brusilovsky,et al.  Methods and techniques of adaptive hypermedia , 1996, User Modeling and User-Adapted Interaction.

[3]  Noel A. Card,et al.  Best practices for missing data management in counseling psychology. , 2010, Journal of counseling psychology.

[4]  C. A. Murthy,et al.  Unsupervised Feature Selection Using Feature Similarity , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  Debahuti Mishra,et al.  Handling Imbalanced Data: A Survey , 2018 .

[6]  B. Tuckman Relations of Academic Procrastination, Rationalizations, and Performance in a Web Course with Deadlines , 2005, Psychological reports.

[7]  Maher M El-Masri,et al.  Handling missing data in self-report measures. , 2005, Research in nursing & health.

[8]  Francisco Herrera,et al.  A Survey of Discretization Techniques: Taxonomy and Empirical Analysis in Supervised Learning , 2013, IEEE Transactions on Knowledge and Data Engineering.

[9]  L. Gerritsen,et al.  Predicting student performance with neural networks , 2017 .

[10]  T. O. Kvålseth Cautionary Note about R 2 , 1985 .

[11]  Anjeela D. Jokhan,et al.  Early warning system as a predictor for student performance in higher education blended courses , 2019 .

[12]  Farshid Marbouti,et al.  Building Course-Specific Regression-based Models to Identify At-risk Students , 2015 .

[13]  Charles R. Graham,et al.  Exploring the potential of LMS log data as a proxy measure of student engagement , 2017, Journal of Computing in Higher Education.

[14]  Cristóbal Romero,et al.  Towards Portability of Models for Predicting Students’ Final Performance in University Courses Starting from Moodle Logs , 2020, Applied Sciences.

[15]  Rianne Conijn,et al.  Predicting Student Performance from LMS Data: A Comparison of 17 Blended Courses Using Moodle LMS , 2017, IEEE Transactions on Learning Technologies.

[16]  Inmaculada Plaza,et al.  Use of LMS functionalities in engineering education , 2011, 2011 Frontiers in Education Conference (FIE).

[17]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[18]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[19]  Shane Dawson,et al.  Mining LMS data to develop an "early warning system" for educators: A proof of concept , 2010, Comput. Educ..

[20]  Baldoino Fonseca dos Santos Neto,et al.  Evaluating the effectiveness of educational data mining techniques for early prediction of students' academic failure in introductory programming courses , 2017, Comput. Hum. Behav..

[21]  María del Puerto Paule Ruíz,et al.  Students' LMS interaction patterns and their relationship with achievement: A case study in higher education , 2016, Comput. Educ..

[22]  Dejan Ljubobratović,et al.  Using LMS activity logs to predict student failure with random forest algorithm , 2019 .

[23]  Chia-Lun Lo,et al.  Developing early warning systems to predict students' online learning performance , 2014, Comput. Hum. Behav..

[24]  Nikola Kadoic,et al.  Analysis of student behavior and success based on logs in Moodle , 2018, 2018 41st International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO).

[25]  Zachary A. Pardos,et al.  Clustering Students to Generate an Ensemble to Improve Standard Test Score Predictions , 2011, AIED.

[26]  Matthias Nückles,et al.  How do Experts Adapt their Explanations to a Layperson’s Knowledge in Asynchronous Communication? An Experimental Study , 2006, User Modeling and User-Adapted Interaction.

[27]  P. Alam ‘N’ , 2021, Composites Engineering: An A–Z Guide.

[28]  Sebastián Ventura,et al.  Predicting students' final performance from participation in on-line discussion forums , 2013, Comput. Educ..

[29]  Kesari Verma,et al.  Investigations on Impact of Feature Normalization Techniques on Classifier's Performance in Breast Tumor Classification , 2015 .

[30]  Nikola M. Tomasevic,et al.  An overview and comparison of supervised data mining techniques for student exam performance prediction , 2020, Comput. Educ..

[31]  Nada Dabbagh,et al.  Using Web-based Pedagogical Tools as Scaffolds for Self-regulated Learning , 2005 .

[32]  M. Braga,et al.  Exploratory Data Analysis , 2018, Encyclopedia of Social Network Analysis and Mining. 2nd Ed..

[33]  Elena Gaudioso,et al.  Mining Student Data To Characterize Similar Behavior Groups In Unstructured Collaboration Spaces , 2004 .

[34]  Dragan Gasevic,et al.  Learning analytics should not promote one size fits all: The effects of instructional conditions in predicting academic success , 2016, Internet High. Educ..

[35]  Miguel Garcia,et al.  Heterogeneous tree structure classification to label Java programmers according to their expertise level , 2020, Future Gener. Comput. Syst..

[36]  Juan Ramón Pérez Pérez,et al.  Adaptation in current e-learning systems , 2008, Comput. Stand. Interfaces.

[37]  Margus Pedaste,et al.  Mining Educational Data to Predict Students’ Performance through Procrastination Behavior , 2019, Entropy.

[38]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[39]  Huan Liu,et al.  Spectral feature selection for supervised and unsupervised learning , 2007, ICML '07.

[40]  Hiroaki Ogata,et al.  A neural network approach for students' performance prediction , 2017, LAK.

[41]  Sebastián Ventura,et al.  Web usage mining for predicting final marks of students that use Moodle courses , 2013, Comput. Appl. Eng. Educ..

[42]  Bjoern H. Menze,et al.  A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data , 2009, BMC Bioinformatics.

[43]  P. Alam ‘S’ , 2021, Composites Engineering: An A–Z Guide.

[44]  Z. Reitermanová Data Splitting , 2010 .

[45]  Dunja Mladenic,et al.  The Role of Hubness in Clustering High-Dimensional Data , 2011, IEEE Transactions on Knowledge and Data Engineering.

[46]  Jose Antonio Morán,et al.  Using agglomerative hierarchical clustering to model learner participation profiles in online discussion forums , 2012, LAK '12.

[47]  Sebastián Ventura,et al.  Classification via clustering for predicting final marks starting from the student participation in Forums , 2012, EDM.

[48]  Martin Hlosta,et al.  OU Analyse: analysing at-risk students at The Open University , 2015 .

[49]  Stefano Nembrini,et al.  The revival of the Gini importance? , 2018, Bioinform..

[50]  Thomas M Kelly,et al.  The business case for e-learning , 2005 .

[51]  Jason Cole,et al.  Using Moodle - teaching with the popular open source course management system , 2007 .

[52]  Il-Hyun Jo,et al.  Clustering blended learning courses by online behavior data: A case study in a Korean higher education institute , 2016, Internet High. Educ..