Log-Data Clustering Analysis for Dropout Prediction in Beginner Programming Classes

Abstract Educational data mining (EDM) involves the application of data mining, machine learning, and statistics to information generated from an educational setting. In most school education, one teacher teaches many students. A periodic examination is used as a method to confirm that students have acquired skills. However, it is difficult to grasp the status of the student from each lesson, since examinations cannot be carried out easily. On the other hand, in programming classes, the students’ history of UNIX commands and source-code editing can be easily and automatically stored as log-data. Therefore, attempts have been made to estimate the student’s performance from this log-data, although their estimation accuracy is not high. In this research, we aim to extract those students who cannot keep up with programming lessons, rather than estimating the student’s performance from the log-data. Specifically, we propose a method for predicting dropouts using outlier detection to cluster data with unsupervised learning.