Chapter 13: Mining Electronic Health Records in the Genomics Era

Abstract: The combination of improved genomic analysis methods, decreasing genotyping costs, and increasing computing resources has led to an explosion of clinical genomic knowledge in the last decade. Similarly, healthcare systems are increasingly adopting robust electronic health record (EHR) systems that not only can improve health care, but also contain a vast repository of disease and treatment data that could be mined for genomic research. Indeed, institutions are creating EHR-linked DNA biobanks to enable genomic and pharmacogenomic research, using EHR data for phenotypic information. However, EHRs are designed primarily for clinical care, not research, so reuse of clinical EHR data for research purposes can be challenging. Difficulties in use of EHR data include: data availability, missing data, incorrect data, and vast quantities of unstructured narrative text data. Structured information includes billing codes, most laboratory reports, and other variables such as physiologic measurements and demographic information. Significant information, however, remains locked within EHR narrative text documents, including clinical notes and certain categories of test results, such as pathology and radiology reports. For relatively rare observations, combinations of simple free-text searches and billing codes may prove adequate when followed by manual chart review. However, to extract the large cohorts necessary for genome-wide association studies, natural language processing methods to process narrative text data may be needed. Combinations of structured and unstructured textual data can be mined to generate high-validity collections of cases and controls for a given condition. Once high-quality cases and controls are identified, EHR-derived cases can be used for genomic discovery and validation. Since EHR data includes a broad sampling of clinically-relevant phenotypic information, it may enable multiple genomic investigations upon a single set of genotyped individuals. This chapter reviews several examples of phenotype extraction and their application to genetic research, demonstrating a viable future for genomic discovery using EHR-linked data.

[1]  J. Gilbertson,et al.  Evaluation of a deidentification (De-Id) software engine to share pathology reports and clinical documents for research. , 2004, American journal of clinical pathology.

[2]  Melissa A. Basford,et al.  Robust replication of genotype-phenotype associations across multiple diseases in an electronic medical record. , 2010, American journal of human genetics.

[3]  A. Hofman,et al.  Association of three genetic loci with uric acid concentration and risk of gout: a genome-wide association study , 2008, The Lancet.

[4]  Hua Xu,et al.  Data from clinical notes: a perspective on the tension between structure and flexible documentation , 2011, J. Am. Medical Informatics Assoc..

[5]  Teri A Manolio,et al.  Collaborative genome-wide association studies of diverse diseases: programs of the NHGRI's office of population genomics. , 2009, Pharmacogenomics.

[6]  Hua Xu,et al.  Portability of an algorithm to identify rheumatoid arthritis in electronic health records , 2012, J. Am. Medical Informatics Assoc..

[7]  George Hripcsak,et al.  An evaluation of natural language processing methodologies , 1998, AMIA.

[8]  Lynette Hirschman,et al.  The MITRE Identification Scrubber Toolkit: Design, training, and assessment , 2010, Int. J. Medical Informatics.

[9]  Peter J. Haug,et al.  Randomized controlled trial of an automated problem list with improved sensitivity , 2008, Int. J. Medical Informatics.

[10]  Joshua C Denny,et al.  Increased hospital mortality in patients with bedside hippus. , 2008, American Journal of Medicine.

[11]  Arnold W. Pratt,et al.  Automatic indexing of pathology data , 1978, J. Am. Soc. Inf. Sci..

[12]  G. Abecasis,et al.  A Genome-Wide Association Study of Type 2 Diabetes in Finns Detects Multiple Susceptibility Variants , 2007, Science.

[13]  P. Haug,et al.  Computerized extraction of coded findings from free-text radiologic reports. Work in progress. , 1990, Radiology.

[14]  Naomi Sager,et al.  Research Paper: Natural Language Processing and the Representation of Clinical Data , 1994, J. Am. Medical Informatics Assoc..

[15]  C. Steiner,et al.  Comorbidity measures for use with administrative data. , 1998, Medical care.

[16]  Joshua C Denny,et al.  Assessing the accuracy of observer-reported ancestry in a biorepository linked to electronic medical records , 2010, Genetics in Medicine.

[17]  A. Rzhetsky,et al.  Probing genetic overlap among complex human phenotypes , 2007, Proceedings of the National Academy of Sciences.

[18]  R. Altman,et al.  Detecting Drug Interactions From Adverse‐Event Reports: Interaction Between Paroxetine and Pravastatin Increases Blood Glucose Levels , 2011, Clinical pharmacology and therapeutics.

[19]  Suzette J. Bielinski,et al.  Use of diverse electronic medical record systems to identify genetic risk for type 2 diabetes within a genome-wide association study , 2012, J. Am. Medical Informatics Assoc..

[20]  Scott T. Weiss,et al.  Extracting principal diagnosis, co-morbidity and smoking status for asthma research: evaluation of a natural language processing system , 2006, BMC Medical Informatics Decis. Mak..

[21]  Marylyn D. Ritchie,et al.  PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene–disease associations , 2010, Bioinform..

[22]  Hua Xu,et al.  Extracting timing and status descriptors for colonoscopy testing from electronic medical records , 2010, J. Am. Medical Informatics Assoc..

[23]  Melissa A. Basford,et al.  Identification of Genomic Predictors of Atrioventricular Conduction: Using Electronic Medical Records as a Tool for Genome Science , 2010, Circulation.

[24]  Catherine A. McCarty,et al.  Informed Consent and Subject Motivation to Participate in a Large, Population-Based Genomics Study: The Marshfield Clinic Personalized Medicine Research Project , 2006, Public Health Genomics.

[25]  J. L. Willems,et al.  The diagnostic performance of computer programs for the interpretation of electrocardiograms. , 1992, The New England journal of medicine.

[26]  Christopher G. Chute,et al.  A Genome-Wide Association Study of Red Blood Cell Traits Using the Electronic Medical Record , 2010, PloS one.

[27]  Peter L. Elkin,et al.  A randomized controlled trial of the accuracy of clinical record retrieval using SNOMED-RT as compared with ICD9-CM , 2001, AMIA.

[28]  Peggy L. Peissig,et al.  Study of Effect of Drug Lexicons on Medication Extraction from Electronic Medical Records , 2004, Pacific Symposium on Biocomputing.

[29]  David J. Carey,et al.  Association of chromosome 9p21 SNPs with cardiovascular phenotypes in morbid obesity using electronic health record data , 2008, Genomic Medicine.

[30]  Melissa A. Basford,et al.  Variants near FOXE1 are associated with hypothyroidism and other thyroid conditions: using electronic medical records for genome- and phenome-wide studies. , 2011, American journal of human genetics.

[31]  Joshua C Denny,et al.  Generating Clinical Notes for Electronic Health Record Systems , 2010, Applied Clinical Informatics.

[32]  Clement J. McDonald,et al.  Development of the Logical Observation Identifier Names and Codes (LOINC) vocabulary. , 1998, Journal of the American Medical Informatics Association : JAMIA.

[33]  Timothy J Wilt,et al.  Transition to the new race/ethnicity data collection standards in the Department of Veterans Affairs , 2006, Population health metrics.

[34]  K. Lunetta,et al.  Genome-wide association with select biomarker traits in the Framingham Heart Study , 2007, BMC Medical Genetics.

[35]  Peter Szolovits,et al.  Evaluating the state-of-the-art in automatic de-identification. , 2007, Journal of the American Medical Informatics Association : JAMIA.

[36]  Anderson Spickard,et al.  Research Paper: "Understanding" Medical School Curriculum Content Using KnowledgeMap , 2003, J. Am. Medical Informatics Assoc..

[37]  Christopher G Chute,et al.  Discovering peripheral arterial disease cases from radiology notes using natural language processing. , 2010, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[38]  F. Collins,et al.  Potential etiologic and functional implications of genome-wide association loci for human diseases and traits , 2009, Proceedings of the National Academy of Sciences.

[39]  Sebastian Schneeweiss,et al.  Accuracy of Medicare claims-based diagnosis of acute myocardial infarction: estimating positive predictive value on the basis of review of hospital records. , 2004, American heart journal.

[40]  L. Cardon,et al.  Population stratification and spurious allelic association , 2003, The Lancet.

[41]  Peter Szolovits,et al.  Genetic basis of autoantibody positive and negative rheumatoid arthritis risk in a multi-ethnic cohort derived from electronic health records. , 2011, American journal of human genetics.

[42]  Christopher G Chute,et al.  Complement receptor 1 gene variants are associated with erythrocyte sedimentation rate. , 2011, American journal of human genetics.

[43]  Daniel J. Vreeman,et al.  Logical Observation Identifiers Names and Codes (LOINC®) users' guide , 2010 .

[44]  Xiaoyan Wang,et al.  Active computerized pharmacovigilance using natural language processing, statistics, and electronic health records: a feasibility study. , 2009, Journal of the American Medical Informatics Association : JAMIA.

[45]  Özlem Uzuner,et al.  Extracting medication information from clinical text , 2010, J. Am. Medical Informatics Assoc..

[46]  Chengfeng Zhao,et al.  Characterization of low-density lipoprotein cholesterol-lowering efficacy for atorvastatin in a population-based DNA biorepository. , 2008, Basic & clinical pharmacology & toxicology.

[47]  R. Platt,et al.  Automated Identification of Acute Hepatitis B Using Electronic Medical Record Data to Facilitate Public Health Surveillance , 2008, PloS one.

[48]  C. Mackenzie,et al.  A new method of classifying prognostic comorbidity in longitudinal studies: development and validation. , 1987, Journal of chronic diseases.

[49]  Richard W. Grant,et al.  Case Report: Using Regular Expressions to Abstract Blood Pressure and Treatment Intensification Information from the Text of Physician Notes , 2006, J. Am. Medical Informatics Assoc..

[50]  Randolph A. Miller,et al.  Identifying UMLS concepts from ECG Impressions using Knowledge Map , 2005, AMIA.

[51]  Peggy L. Peissig,et al.  Development of an optical character recognition pipeline for handwritten form fields from an electronic health record , 2012, J. Am. Medical Informatics Assoc..

[52]  Carol A. Keohane,et al.  Effect of bar-code technology on the safety of medication administration. , 2010, The New England journal of medicine.

[53]  K. Mossman The Wellcome Trust Case Control Consortium, U.K. , 2008 .

[54]  B. Dean,et al.  Review: Use of Electronic Medical Records for Health Outcomes Research , 2009, Medical care research and review : MCRR.

[55]  Son Doan,et al.  Application of information technology: MedEx: a medication information extraction system for clinical narratives , 2010, J. Am. Medical Informatics Assoc..

[56]  Wendy W. Chapman,et al.  A Simple Algorithm for Identifying Negated Findings and Diseases in Discharge Summaries , 2001, J. Biomed. Informatics.

[57]  C. Carlson,et al.  Genetic variants associated with the white blood cell count in 13,923 subjects in the eMERGE Network , 2011, Human Genetics.

[58]  I. Kohane Using electronic health records to drive discovery in disease genomics , 2011, Nature Reviews Genetics.

[59]  Michael D Howell,et al.  Acid-suppressive medication use and the risk for hospital-acquired pneumonia. , 2009, JAMA.

[60]  Atul J. Butte,et al.  Pacific Symposium on Biocomputing 13:243-254(2008) NOVEL INTEGRATION OF HOSPITAL ELECTRONIC MEDICAL RECORDS AND GENE EXPRESSION MEASUREMENTS TO IDENTIFY GENETIC MARKERS OF MATURATION , 2022 .

[61]  Sunghwan Sohn,et al.  Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications , 2010, J. Am. Medical Informatics Assoc..

[62]  Randolph A. Miller,et al.  Identifying QT prolongation from ECG impressions using a general-purpose Natural Language Processor , 2009, Int. J. Medical Informatics.

[63]  I. Kohane,et al.  Electronic medical records for discovery research in rheumatoid arthritis , 2010, Arthritis care & research.

[64]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[65]  J. L. Willems,et al.  The Diagnostic Performance of Computer Programs for the Interpretation of Electrocardiograms , 1991 .

[66]  Alan R. Aronson,et al.  An overview of MetaMap: historical perspective and recent advances , 2010, J. Am. Medical Informatics Assoc..

[67]  R. Steinbrook,et al.  Opportunities and Challenges for the NIH — An Interview with Francis Collins , 2009 .

[68]  Lin Chen,et al.  Importance of multi-modal approaches to effectively identify cataract cases from electronic health records , 2012, J. Am. Medical Informatics Assoc..

[69]  Randolph A. Miller,et al.  Research Paper: Evaluation of a Method to Identify and Categorize Section Headers in Clinical Documents , 2009, J. Am. Medical Informatics Assoc..

[70]  Josée Dupuis,et al.  Genome-wide association with bone mass and geometry in the Framingham Heart Study , 2007, BMC Medical Genetics.

[71]  Francis Collins,et al.  Opportunities and challenges for the NIH--an interview with Francis Collins. Interview by Robert Steinbrook. , 2009, The New England journal of medicine.

[72]  D. Roden,et al.  Development of a Large‐Scale De‐Identified DNA Biobank to Enable Personalized Medicine , 2008, Clinical pharmacology and therapeutics.

[73]  Vasile Stoicu-Tivadar,et al.  Using Modern Technologies to Facilitate Translating Logical Observation Identifiers Names and Codes , 2014, SOFA.

[74]  Li Li,et al.  Comparing ICD9-Encoded Diagnoses and NLP-Processed Discharge Summaries for Clinical Trials Pre-Screening: A Case Study , 2008, AMIA.

[75]  Simon C. Potter,et al.  Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls , 2007, Nature.

[76]  Christopher G Chute,et al.  Analyzing the heterogeneity and complexity of Electronic Health Record oriented phenotyping algorithms. , 2011, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[77]  Jonathan S Schildcrout,et al.  Research Paper: Medication Administration Discrepancies Persist Despite Electronic Ordering , 2007, J. Am. Medical Informatics Assoc..