Applying Data Science Methods for Early Prediction of Undergraduate Student Retention

This paper presents a case study of applying the data science methods to a large education data collected at a University over 7 years. The goal of the study is to understand the important features, and to derive models for predicting student retention. Issues dealing with real world data, for example variable definition, missing data handling, and data cleaning were discussed. A new recursive feature elimination based feature selection method was developed. This study derived features and models for four different student groups, the first-generation students, the African American students, the Hispanic students, and the disabled students. The features identified and the predictive models built were compared crossed the four groups.