Missing Data Imputation Method Comparison in Ohio University Student Retention Database

Ohio University has been conducting research on first-year-student retention to prevent dropouts (OU Office of Institutional Research, First-Year Students Retention, 2008). Yet, the data set has more than 20% missing values. Missing data affects the ability in result generalization of the target population. This study categorizes the missing data into one of three types of missing data: missing completely at random (MCAR), missing at random (MAR), or missing not at random (MNAR). After the missing data is identified, the proper method of handling the data is discussed. Five methods were utilized in the research: mean, median, zero, hot-deck and multiple imputations. Despite the poor performance on the accuracy comparison test, multiple and hot-deck imputation have proven to improve the retention prediction rate. Mean and median imputation perform better in accuracy and are sufficient for the prediction model.

[1]  James V. Koch,et al.  TQM: why is its impact in higher education so small? , 2003 .

[2]  P. Wilcox,et al.  ‘It was nothing to do with the university, it was just the people’: the role of social support in the first‐year experience of higher education , 2005 .

[3]  R. Dodhia A Review of Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences (3rd ed.) , 2005 .

[4]  A. Astin Retaining and Satisfying Students. , 1987 .

[5]  John L.P. Thompson,et al.  Missing data , 2004, Amyotrophic lateral sclerosis and other motor neuron disorders : official publication of the World Federation of Neurology, Research Group on Motor Neuron Diseases.

[6]  D. Rubin Multiple imputation for nonresponse in surveys , 1989 .

[7]  C. Goenner,et al.  A Predictive Model of Inquiry to Enrollment , 2006 .

[8]  L. A. von Hellens Information systems quality versus software quality - A discussion from a managerial, an organizational and an engineering viewpoint , 1999 .

[9]  Andrew Gelman,et al.  Data Analysis Using Regression and Multilevel/Hierarchical Models: Simulation of probability models and statistical inferences , 2006 .

[10]  C. Särndal,et al.  A General View of Estimation for Two Phases of Selection with Applications to Two-Phase Sampling and Nonresponse , 1987 .

[11]  Patrick E. McKnight Missing Data: A Gentle Introduction , 2007 .

[12]  P. Müller Monte Carlo Methods and Bayesian Computation : MCMC , 2004 .

[13]  Jacob Cohen,et al.  Applied multiple regression/correlation analysis for the behavioral sciences , 1979 .

[14]  A. Astin Student involvement: A developmental theory for higher education. , 1999 .

[15]  Gladys McPherson,et al.  Simple imputation methods were inadequate for missing not at random (MNAR) quality of life data , 2008, Health and quality of life outcomes.

[16]  Saket Khajuria A Model to Predict Student Matriculation from Admissions Data , 2007 .

[17]  Roger A. Sugden,et al.  Multiple Imputation for Nonresponse in Surveys , 1988 .

[18]  Paul Zhang Multiple Imputation: Theory and Method , 2003 .

[19]  Vincent Tinto Dropout from Higher Education: A Theoretical Synthesis of Recent Research , 1975 .

[20]  Donald B. Rubin,et al.  Multiple imputations in sample surveys , 1978 .

[21]  Linda M. Kneidinger,et al.  Not Just the Usual Cast of Characteristics: Using Personality To Predict College Performance and Retention. , 2000 .

[22]  Sadie E. Roth A Model to Predict Ohio University Student Attrition from Admissions and Involvement Data , 2008 .

[23]  Kathleen M. H. Brady National Center for Education Statistics Home Page , 2000 .

[24]  Marie Reilly,et al.  Data analysis using hot deck multiple imputation , 1993 .

[25]  Anne-Catherine Favre,et al.  Calibrated random imputation for qualitative data , 2005 .

[26]  J. F. Bjørnstad Non-Bayesian multiple imputation , 2005 .

[27]  Nathaniel Schenker,et al.  Multiple imputation for national public-use datasets and its possible application for gestational age in United States Natality files. , 2007, Paediatric and perinatal epidemiology.

[28]  Gustavo E. A. P. A. Batista,et al.  An analysis of four missing data treatment methods for supervised learning , 2003, Appl. Artif. Intell..

[29]  Mete Sirvanci Critical issues for TQM implementation in higher education , 2004 .

[30]  Daniel J. Mundfrom,et al.  Imputing Missing Values: The Effect on the Accuracy of Classification , 1998 .

[31]  R. Reason Student Variables that Predict Retention: Recent Research and New Developments , 2003 .

[32]  Mark W. Fraser,et al.  A Simplified Framework for Using Multiple Imputation in Social Work Research , 2008 .