Integration of genetic and clinical information to improve imputation of data missing from electronic health records

OBJECTIVE Clinical data of patients' measurements and treatment history stored in electronic health record (EHR) systems are starting to be mined for better treatment options and disease associations. A primary challenge associated with utilizing EHR data is the considerable amount of missing data. Failure to address this issue can introduce significant bias in EHR-based research. Currently, imputation methods rely on correlations among the structured phenotype variables in the EHR. However, genetic studies have shown that many EHR-based phenotypes have a heritable component, suggesting that measured genetic variants might be useful for imputing missing data. In this article, we developed a computational model that incorporates patients' genetic information to perform EHR data imputation. MATERIALS AND METHODS We used the individual single nucleotide polymorphism's association with phenotype variables in the EHR as input to construct a genetic risk score that quantifies the genetic contribution to the phenotype. Multiple approaches to constructing the genetic risk score were evaluated for optimal performance. The genetic score, along with phenotype correlation, is then used as a predictor to impute the missing values. RESULTS To demonstrate the method performance, we applied our model to impute missing cardiovascular related measurements including low-density lipoprotein, heart failure, and aortic aneurysm disease in the electronic Medical Records and Genomics data. The integration method improved imputation's area-under-the-curve for binary phenotypes and decreased root-mean-square error for continuous phenotypes. CONCLUSION Compared with standard imputation approaches, incorporating genetic information offers a novel approach that can utilize more of the EHR data for better performance in missing data imputation.

[1]  Brett K. Beaulieu-Jones,et al.  Characterizing and Managing Missing Structured Data in Electronic Health Records: Data Analysis , 2017, bioRxiv.

[2]  Manuel A. R. Ferreira,et al.  PLINK: a tool set for whole-genome association and population-based linkage analyses. , 2007, American journal of human genetics.

[3]  Ida Surakka,et al.  Electronic health records: the next wave of complex disease genetics. , 2018, Human molecular genetics.

[4]  F. Agakov,et al.  Genomic prediction of complex human traits: relatedness, trait architecture and predictive meta-models , 2015, Human molecular genetics.

[5]  Jakob Grove,et al.  Polygenic transmission disequilibrium confirms that common and rare variation act additively to create risk for autism spectrum disorders , 2016, Nature Genetics.

[6]  Melissa A. Basford,et al.  The Electronic Medical Records and Genomics (eMERGE) Network: past, present, and future , 2013, Genetics in Medicine.

[7]  et al.,et al.  Missing Data Imputation in the Electronic Health Record Using Deeply Learned Autoencoders , 2017, PSB.

[8]  A. Chakravarti,et al.  Revealing rate‐limiting steps in complex disease biology: The crucial importance of studying rare, extreme‐phenotype families , 2016, BioEssays : news and reviews in molecular, cellular and developmental biology.

[9]  Stephen B. Johnson,et al.  A review of approaches to identifying patient phenotype cohorts using electronic health records , 2013, J. Am. Medical Informatics Assoc..

[10]  Eric Boerwinkle,et al.  Association of Genome-Wide Variation With the Risk of Incident Heart Failure in Adults of European and African Ancestry: A Prospective Meta-Analysis From the Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium , 2010, Circulation. Cardiovascular genetics.

[11]  J. Hopper,et al.  Breast cancer risk prediction using a polygenic risk score in the familial setting: a prospective study from the Breast Cancer Family Registry and kConFab , 2016, Genetics in Medicine.

[12]  Peter Szolovits,et al.  Toward high-throughput phenotyping: unbiased automated feature extraction and selection from knowledge sources , 2015, J. Am. Medical Informatics Assoc..

[13]  B. Pasaniuc,et al.  Contrasting the genetic architecture of 30 complex traits from summary association data , 2016, bioRxiv.

[14]  P. Visscher,et al.  Common polygenic variation contributes to risk of schizophrenia and bipolar disorder , 2009, Nature.

[15]  Jianjun Liu,et al.  Breast cancer risk prediction and individualised screening based on common genetic variation and breast density measurement , 2011, Breast Cancer Research.

[16]  Wendy A. Wolf,et al.  The eMERGE Network: A consortium of biorepositories linked to electronic medical records data for conducting genomic studies , 2011, BMC Medical Genomics.

[17]  Gary King,et al.  Logistic Regression in Rare Events Data , 2001, Political Analysis.

[18]  Ross M. Fraser,et al.  Defining the role of common variation in the genomic and biological architecture of adult human height , 2014, Nature Genetics.

[19]  J. Graham,et al.  Missing data analysis: making it work in the real world. , 2009, Annual review of psychology.

[20]  I. Kohane Using electronic health records to drive discovery in disease genomics , 2011, Nature Reviews Genetics.

[21]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[22]  R. Stafford,et al.  Underdiagnosis of hypertension using electronic health records. , 2012, American journal of hypertension.

[23]  Thomas Lengauer,et al.  ROCR: visualizing classifier performance in R , 2005, Bioinform..

[24]  Patrick Royston,et al.  Multiple imputation using chained equations: Issues and guidance for practice , 2011, Statistics in medicine.

[25]  Sudha Seshadri,et al.  Framingham Heart Study 100K project: genome-wide associations for cardiovascular disease outcomes , 2007, BMC Medical Genetics.

[26]  Stef van Buuren,et al.  MICE: Multivariate Imputation by Chained Equations in R , 2011 .

[27]  E. Vassos,et al.  Prospects for using risk scores in polygenic medicine , 2017, Genome Medicine.

[28]  Marylyn D. Ritchie,et al.  Imputation and quality control steps for combining multiple genome-wide datasets , 2014, Front. Genet..

[29]  Kenneth D. McClatchey,et al.  Clinical laboratory medicine , 1994 .

[30]  H. Prokosch,et al.  Perspectives for Medical Informatics , 2009, Methods of Information in Medicine.

[31]  George Hripcsak,et al.  Next-generation phenotyping of electronic health records , 2012, J. Am. Medical Informatics Assoc..

[32]  J. Hardy,et al.  Polygenic score prediction captures nearly all common genetic risk for Alzheimer's disease , 2017, Neurobiology of Aging.

[33]  B. Wells,et al.  Strategies for Handling Missing Data in Electronic Health Record Derived Data , 2013, EGEMS.