The HCUP SID Imputation Project: Improving Statistical Inferences for Health Disparities Research by Imputing Missing Race Data

OBJECTIVE To identify the most appropriate imputation method for missing data in the HCUP State Inpatient Databases (SID) and assess the impact of different missing data methods on racial disparities research. DATA SOURCES/STUDY SETTING HCUP SID. STUDY DESIGN A novel simulation study compared four imputation methods (random draw, hot deck, joint multiple imputation [MI], conditional MI) for missing values for multiple variables, including race, gender, admission source, median household income, and total charges. The simulation was built on real data from the SID to retain their hierarchical data structures and missing data patterns. Additional predictive information from the U.S. Census and American Hospital Association (AHA) database was incorporated into the imputation. PRINCIPAL FINDINGS Conditional MI prediction was equivalent or superior to the best performing alternatives for all missing data structures and substantially outperformed each of the alternatives in various scenarios. CONCLUSIONS Conditional MI substantially improved statistical inferences for racial health disparities research with the SID.

[1]  Xavier Basagaña,et al.  Methods for Handling Missing Variables in Risk Prediction Models. , 2016, American journal of epidemiology.

[2]  Andrew Gelman,et al.  Data Analysis Using Regression and Multilevel/Hierarchical Models , 2006 .

[3]  Megan K. Beckett,et al.  Using Indirect Estimates Based on Name and Census Tract to Improve the Efficiency of Sampling Matched Ethnic Couples from Marriage License Data , 2013 .

[4]  Frederick P Rivara,et al.  National Variation in Outcomes and Costs for Splenic Injury and the Impact of Trauma Systems: A Population-Based Cohort Study , 2012, Annals of surgery.

[5]  D. Rubin,et al.  Handling “Don't Know” Survey Responses: The Case of the Slovenian Plebiscite , 1995 .

[6]  M Y Hu,et al.  Performance of a general location model with an ignorable missing-data assumption in a multivariate mental health services study. , 1999, Statistics in medicine.

[7]  Alan M. Zaslavsky,et al.  Using Calibration to Improve Rounding in Imputation , 2008 .

[8]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.

[9]  Jane A Hoppin,et al.  Using multiple imputation to assign pesticide use for non-responders in the follow-up questionnaire in the Agricultural Health Study , 2012, Journal of Exposure Science and Environmental Epidemiology.

[10]  D. Pollard A User's Guide to Measure Theoretic Probability by David Pollard , 2001 .

[11]  A. Zaslavsky,et al.  Multiple imputation in a large-scale complex survey: a practical guide , 2010, Statistical methods in medical research.

[12]  Fred J. Hellinger,et al.  HIV Patients in the HCUP Database: A Study of Hospital Utilization and Costs , 2004, Inquiry : a journal of medical care organization, provision and financing.

[13]  Gary King,et al.  Amelia II: A Program for Missing Data , 2011 .

[14]  Stef van Buuren,et al.  MICE: Multivariate Imputation by Chained Equations in R , 2011 .

[15]  Richard E Hughes,et al.  Factors Affecting Readmission Cost After Primary Total Knee Arthroplasty in Michigan. , 2016, The Journal of arthroplasty.

[16]  Fritz Scheuren,et al.  Multiple Imputation , 2005 .

[17]  Roger A. Sugden,et al.  Multiple Imputation for Nonresponse in Surveys , 1988 .

[18]  Ingram Olkin,et al.  Incomplete data in sample surveys. Vol. 2: theory and bibliographies , 1983 .

[19]  G. Molenberghs,et al.  Linear Mixed Models for Longitudinal Data , 2001 .

[20]  R. Deyo,et al.  Adapting a clinical comorbidity index for use with ICD-9-CM administrative databases. , 1992, Journal of clinical epidemiology.

[21]  T. Raghunathan,et al.  Multiple Imputation of Missing Income Data in the National Health Interview Survey , 2006 .

[22]  A. Gelman,et al.  Multiple Imputation with Diagnostics (mi) in R: Opening Windows into the Black Box , 2011 .

[23]  Dorota Kurowicka,et al.  Generating random correlation matrices based on vines and extended onion method , 2009, J. Multivar. Anal..

[24]  Roderick J. A. Little,et al.  The NHANES III multiple imputation project , 1996 .

[25]  John Van Hoewyk,et al.  A multivariate technique for multiply imputing missing values using a sequence of regression models , 2001 .

[26]  John B Carlin,et al.  Multiple imputation for missing data: fully conditional specification versus multivariate normal imputation. , 2010, American journal of epidemiology.

[27]  Nicholas J. Horton,et al.  A Potential for Bias When Rounding in Multiple Imputation , 2003 .

[28]  R. Kronmal,et al.  Multiple imputation of baseline data in the cardiovascular health study. , 2003, American journal of epidemiology.

[29]  Oliver Rivero-Arias,et al.  Evaluation of software for multiple imputation of semi-continuous data , 2007, Statistical methods in medical research.

[30]  Brigitte Escofier,et al.  Analyse factorielle et distances répondant au principe d'équivalence distributionnelle , 1978 .

[31]  Ken P Kleinman,et al.  Much Ado About Nothing , 2007, The American statistician.

[32]  Mulugeta Gebregziabher,et al.  Lessons Learned in Dealing with Missing Race Data: An EmpiricalInvestigation , 2012 .

[33]  Roderick J Little,et al.  The Use of Sample Weights in Hot Deck Imputation. , 2009, Journal of official statistics.

[34]  Judith A. Long,et al.  Missing Race/Ethnicity Data in Veterans Health Administration Based Disparities Research: A Systematic Review , 2006, Journal of health care for the poor and underserved.

[35]  R. Little,et al.  Maximum likelihood estimation for mixed continuous and categorical data with missing values , 1985 .

[36]  D. Rubin,et al.  Statistical Analysis with Missing Data , 1988 .

[37]  Gn du Plessis,et al.  Incidence of syndesmotic injuries in all different types of ankle fractures , 2008 .