Validating gene-phenotype associations using relationships in the UMLS

Objective Large scale next-generation sequencing of population cohorts paired with patients’ electronic health records (EHR) provides an excellent resource for the study of gene-disease associations. To validate those associations, researchers often consult databases that identify relationships between genes of interest and relevant disease phenotypes, which we refer to as simply “phenotypes”. However, most of these databases contain phenotypes that are not suited for automated analysis of EHR data, which often captured these phenotypes in the form of International Classification of Diseases (ICD) codes. There is a need for a resource that comprehensively provides gene-phenotype mappings in a format that can be used to evaluate phenotypes from EHR. Methods We built a directed graph database of genes, medical concepts and ICD codes based on a subset of the National Library of Medicine’s Unified Medical Language System (UMLS) and other resources. To obtain associations between genes and ICD codes, we traversed the defined relationships from gene, variant and disease concepts to ICD codes, resulting in a set of mappings that link specific genes and variants to these ICD codes. Results Our method created 249,764 mappings between genes and ICD codes, including 27,226 “disease” phenotypes and 222,538 “symptom” phenotypes, and provided mappings for 4,456 unique genes. Paths were validated by manual review of a diverse sample of paths. In a cohort of 92,455 samples, we used these mappings to validate gene-phenotype associations in 32,786 samples where a person had a potentially disease-causing genetic mutation and at least one corresponding diagnosis in their EHR. Conclusion The concepts and relationships in the UMLS can be used to generate gene-ICD phenotype mappings that are not explicit in the source vocabularies. We were able use these mappings to validate gene-disease associations in a large cohort of sequenced exomes paired with EHR.

[1]  S. Henikoff,et al.  Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm , 2009, Nature Protocols.

[2]  Stephen B. Soumerai,et al.  Missing clinical and behavioral health data in a large electronic health record (EHR) system , 2016, J. Am. Medical Informatics Assoc..

[3]  W. Kibbe,et al.  Annotating the human genome with Disease Ontology , 2009, BMC Genomics.

[4]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[5]  Thomas C. Wiegers,et al.  Generating Gene Ontology-Disease Inferences to Explore Mechanisms of Human Disease at the Comparative Toxicogenomics Database , 2016, PloS one.

[6]  V. McKusick Mendelian Inheritance in Man and Its Online Version, OMIM , 2007, The American Journal of Human Genetics.

[7]  L. Brown,et al.  Interval Estimation for a Binomial Proportion , 2001 .

[8]  Martin Odersky,et al.  An Overview of the Scala Programming Language , 2004 .

[9]  Susan Tweedie,et al.  Genenames.org: the HGNC and VGNC resources in 2017 , 2016, Nucleic Acids Res..

[10]  P. Stenson,et al.  Human Gene Mutation Database (HGMD®): 2003 update , 2003, Human mutation.

[11]  Olivier Bodenreider,et al.  GenesTrace: Phenomic Knowledge Discovery via Structured Terminology , 2004, Pacific Symposium on Biocomputing.

[12]  Adi V. Gundlapalli,et al.  Exploiting the UMLS Metathesaurus for extracting and categorizing concepts representing signs and symptoms to anatomically related organ systems , 2015, J. Biomed. Informatics.

[13]  Csongor Nyulas,et al.  BioPortal: enhanced functionality via new Web services from the National Center for Biomedical Ontology to access and use ontologies in software applications , 2011, Nucleic Acids Res..

[14]  Jana Marie Schwarz,et al.  MutationTaster2: mutation prediction for the deep-sequencing age , 2014, Nature Methods.

[15]  Yutaka Saito,et al.  Detection of differentially methylated regions from bisulfite-seq data by hidden Markov models incorporating genome-wide methylation level distributions , 2015, BMC Genomics.

[16]  P. Bork,et al.  A method and server for predicting damaging missense mutations , 2010, Nature Methods.

[17]  Reynold Xin,et al.  GraphFrames: an integrated API for mixing graph and relational queries , 2016, GRADES '16.

[18]  Marylyn D. Ritchie,et al.  Distribution and clinical impact of functional variants in 50,726 whole-exome sequences from the DiscovEHR study , 2016, Science.

[19]  Justin C. Fay,et al.  Identification of deleterious mutations within three human genomes. , 2009, Genome research.

[20]  O. Gottesman,et al.  Clinical and Molecular Prevalence of Lipodystrophy in an Unascertained Large Clinical Care Cohort , 2019, Diabetes.

[21]  Rachel Thompson,et al.  A nomenclature and classification for the congenital myasthenic syndromes: preparing for FAIR data in the genomic era , 2018, Orphanet Journal of Rare Diseases.

[22]  Deanna M. Church,et al.  ClinVar: public archive of relationships among sequence variation and human phenotype , 2013, Nucleic Acids Res..

[23]  Richard Bonneau,et al.  BioNetBuilder2.0: bringing systems biology to chicken and other model organisms , 2009, BMC Genomics.

[24]  Rui Jiang,et al.  Pinpointing disease genes through phenomic and genomic data fusion , 2015, BMC Genomics.

[25]  Matthew S. Lebo,et al.  Electronic Health Record Phenotype in Subjects with Genetic Variants Associated with Arrhythmogenic Right Ventricular Cardiomyopathy: A Study in 30,716 Subjects with Exome Sequencing , 2017, Genetics in Medicine.

[26]  George Hripcsak,et al.  Deep Phenotyping on Electronic Health Records Facilitates Genetic Diagnosis by Clinical Exomes. , 2018, American journal of human genetics.