Feasibility of Using Clinical Element Models (CEM) to Standardize Phenotype Variables in the Database of Genotypes and Phenotypes (dbGaP)

The database of Genotypes and Phenotypes (dbGaP) contains various types of data generated in Genome Wide Association Studies (GWAS). These data can be used to facilitate novel scientific discovery and to reduce cost and time for exploratory research. However, idiosyncrasies in variable names become a major barrier for reusing these data. We studied the problem of formalizing the phenotype variable descriptions using Clinical Element Models (CEM). Direct mapping of 379 phenotype names to existing CEM yielded a low rate of exact matches (N=25). However, the flexible and expressive underlying information models of CEM provided a robust means of representing 115 phenotype variable descriptions, indicating that CEMs can be successfully applied to standardize a large portion of the clinical variables contained in dbGaP.