RIDDLE: Race and ethnicity Imputation from Disease history with Deep LEarning

Anonymized electronic medical records are an increasingly popular source of research data. However, these datasets often lack race and ethnicity information. This creates problems for researchers modeling human disease, as race and ethnicity are powerful confounders for many health exposures and treatment outcomes; race and ethnicity are closely linked to population-specific genetic variation. We showed that deep neural networks generate more accurate estimates for missing racial and ethnic information than competing methods (e.g., logistic regression, random forest, support vector machines, and gradient-boosted decision trees). RIDDLE yielded significantly better classification performance across all metrics that were considered: accuracy, cross-entropy loss (error), precision, recall, and area under the curve for receiver operating characteristic plots (all p < 10−9). We made specific efforts to interpret the trained neural network models to identify, quantify, and visualize medical features which are predictive of race and ethnicity. We used these characterizations of informative features to perform a systematic comparison of differential disease patterns by race and ethnicity. The fact that clinical histories are informative for imputing race and ethnicity could reflect (1) a skewed distribution of blue- and white-collar professions across racial and ethnic groups, (2) uneven accessibility and subjective importance of prophylactic health, (3) possible variation in lifestyle, such as dietary habits, and (4) differences in background genetic variation which predispose to diseases.

[1]  Avanti Shrikumar,et al.  Learning Important Features Through Propagating Activation Differences , 2017, ICML.

[2]  M. Sims,et al.  Neighborhood Disadvantage, Poor Social Conditions, and Cardiovascular Disease Incidence Among African American Adults in the Jackson Heart Study. , 2016, American journal of public health.

[3]  Anna Shcherbina,et al.  Not Just a Black Box: Learning Important Features Through Propagating Activation Differences , 2016, ArXiv.

[4]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[5]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[6]  Keon L. Gilbert,et al.  Racial Composition Over the Life Course: Examining Separate and Unequal Environments and the Risk for Heart Disease for African American Men. , 2015, Ethnicity & disease.

[7]  Mark J Ramos,et al.  Imputing Missing Race/Ethnicity in Pediatric Electronic Health Records: Reducing Bias with Use of U.S. Census Location and Surname Data. , 2015, Health services research.

[8]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[9]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[10]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[11]  M. Hulihan,et al.  Incidence of Sickle Cell Trait — United States, 2010 , 2014, MMWR. Morbidity and mortality weekly report.

[12]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[13]  M. Kohli,et al.  Differences in life expectancy due to race and educational differences are widening, and many may not catch up. , 2012, Health affairs.

[14]  Yoshua Bengio,et al.  Practical Recommendations for Gradient-Based Training of Deep Architectures , 2012, Neural Networks: Tricks of the Trade.

[15]  N. J. Guzman Epidemiology and Management of Hypertension in the Hispanic Population , 2012, American Journal of Cardiovascular Drugs.

[16]  S. Brunak,et al.  Mining electronic health records: towards better research applications and clinical care , 2012, Nature Reviews Genetics.

[17]  Vitaly Shmatikov,et al.  2011 IEEE Symposium on Security and Privacy “You Might Also Like:” Privacy Risks of Collaborative Filtering , 2022 .

[18]  W. Cunningham,et al.  The Impact of Acculturation on Utilization of HIV Prevention Services and Access to Care Among an At-Risk Hispanic Population , 2009, Journal of health care for the poor and underserved.

[19]  M. Kenward,et al.  Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls , 2009, BMJ : British Medical Journal.

[20]  M. Elliott,et al.  A new method for estimating race/ethnicity and associated disparities where administrative records lack self-reported race/ethnicity. , 2008, Health services research.

[21]  N. Crepaz,et al.  The Efficacy of Behavioral Interventions in Reducing HIV Risk Sex Behaviors and Incident Sexually Transmitted Disease in Black and Hispanic Sexually Transmitted Disease Clinic Patients in the United States: A Meta-Analytic Review , 2006, Sexually transmitted diseases.

[22]  Lefteris Angelis,et al.  Categorical missing data imputation for software cost estimation by multinomial logistic regression , 2006, J. Syst. Softw..

[23]  N. Risch,et al.  The importance of race and ethnic background in biomedical research and clinical practice. , 2003, The New England journal of medicine.

[24]  E. Crimmins,et al.  Trends in healthy life expectancy in the United States, 1970-1990: gender, racial, and educational differences. , 2001, Social science & medicine.

[25]  G. Strickland,et al.  Racial differences in reported Lyme disease incidence. , 2000, American journal of epidemiology.

[26]  R. Gillum,et al.  Diabetes mellitus, coronary heart disease incidence, and death from all causes in African American and European American women: The NHANES I epidemiologic follow-up study. , 2000, Journal of clinical epidemiology.

[27]  K A Schulman,et al.  The effect of race and sex on physicians' recommendations for cardiac catheterization. , 1999, The New England journal of medicine.