Mapping clinical phenotype data elements to standardized metadata repositories and controlled terminologies: the eMERGE Network experience

BACKGROUND Systematic study of clinical phenotypes is important for a better understanding of the genetic basis of human diseases and more effective gene-based disease management. A key aspect in facilitating such studies requires standardized representation of the phenotype data using common data elements (CDEs) and controlled biomedical vocabularies. In this study, the authors analyzed how a limited subset of phenotypic data is amenable to common definition and standardized collection, as well as how their adoption in large-scale epidemiological and genome-wide studies can significantly facilitate cross-study analysis. METHODS The authors mapped phenotype data dictionaries from five different eMERGE (Electronic Medical Records and Genomics) Network sites studying multiple diseases such as peripheral arterial disease and type 2 diabetes. For mapping, standardized terminological and metadata repository resources, such as the caDSR (Cancer Data Standards Registry and Repository) and SNOMED CT (Systematized Nomenclature of Medicine), were used. The mapping process comprised both lexical (via searching for relevant pre-coordinated concepts and data elements) and semantic (via post-coordination) techniques. Where feasible, new data elements were curated to enhance the coverage during mapping. A web-based application was also developed to uniformly represent and query the mapped data elements from different eMERGE studies. RESULTS Approximately 60% of the target data elements (95 out of 157) could be mapped using simple lexical analysis techniques on pre-coordinated terms and concepts before any additional curation of terminology and metadata resources was initiated by eMERGE investigators. After curation of 54 new caDSR CDEs and nine new NCI thesaurus concepts and using post-coordination, the authors were able to map the remaining 40% of data elements to caDSR and SNOMED CT. A web-based tool was also implemented to assist in semi-automatic mapping of data elements. CONCLUSION This study emphasizes the requirement for standardized representation of clinical research data using existing metadata and terminology resources and provides simple techniques and software for data element mapping using experiences from the eMERGE Network.

[1]  Jérôme Euzenat,et al.  A Survey of Schema-Based Matching Approaches , 2005, J. Data Semant..

[2]  J. Euzenat,et al.  Ontology Matching , 2007, Springer Berlin Heidelberg.

[3]  Mark A. Musen,et al.  Creating Mappings For Ontologies in Biomedicine: Simple Methods Work , 2009, AMIA.

[4]  Alenka Sauperl Precoordination or not?: A new view of the old question , 2009, J. Documentation.

[5]  Brad T. Sherman,et al.  Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists , 2008, Nucleic acids research.

[6]  Christopher G. Chute,et al.  BioPortal: ontologies and integrated data resources at the click of a mouse , 2009, Nucleic Acids Res..

[7]  Loren Paul Rees,et al.  Technical Brief: Development and Evaluation of Methods for Structured Recording of Heart Murmur Findings Using SNOMED-CT® Post-Coordination , 2006, J. Am. Medical Informatics Assoc..

[8]  Diego Calvanese,et al.  The Description Logic Handbook , 2007 .

[9]  Pedro M. Domingos,et al.  Learning to match ontologies on the Semantic Web , 2003, The VLDB Journal.

[10]  Carol M Hamilton,et al.  PhenX: a toolkit for interdisciplinary genetics research , 2010, Current opinion in lipidology.

[11]  Harold R. Solbrig,et al.  Representing the NCI Thesaurus in OWL DL: Modeling tools help modeling languages , 2008, Appl. Ontology.

[12]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[13]  Alan L. Rector,et al.  Why do it the hard way? The Case for an Expressive Description Logic for SNOMED , 2008, KR-MED.

[14]  David G. Stork,et al.  Pattern Classification , 1973 .

[15]  Uri Miller,et al.  Pre-coordination and post-coordination: Past and future , 2002 .

[16]  Jeffrey P. Krischer,et al.  Comparing heterogeneous SNOMED CT coding of clinical research concepts by examining normalized expressions , 2008, J. Biomed. Informatics.

[17]  Deborah L. McGuinness,et al.  OWL Web ontology language overview , 2004 .

[18]  Natalya F. Noy,et al.  Semantic integration: a survey of ontology-based approaches , 2004, SGMD.

[19]  C. Sabatti,et al.  The Human Phenome Project , 2003, Nature Genetics.

[20]  Olivier Bodenreider,et al.  Mapping data elements to terminological resources for integrating biomedical data sources , 2006, BMC Bioinformatics.

[21]  Marylyn D. Ritchie,et al.  Return of individual research results from genome-wide association studies: experience of the Electronic Medical Records and Genomics (eMERGE) Network , 2012, Genetics in Medicine.

[22]  K. Sirotkin,et al.  The NCBI dbGaP database of genotypes and phenotypes , 2007, Nature Genetics.

[23]  Ronald Cornet Definitions and qualifiers in SNOMED CT. , 2009, Methods of information in medicine.

[24]  Heiner Stuckenschmidt,et al.  Improving Ontology Matching Using Meta-level Learning , 2009, ESWC.

[25]  Mark A. Musen,et al.  What Four Million Mappings Can Tell You about Two Hundred Ontologies , 2009, SEMWEB.