A System for Phenotype Harmonization in the National Heart, Lung, and Blood Institute Trans-Omics for Precision Medicine (TOPMed) Program

Genotype-phenotype association studies often combine phenotype data from multiple studies to increase power. Harmonization of the data usually requires substantial effort due to heterogeneity in phenotype definitions, study design, data collection procedures, and data set organization. Here we describe a centralized system for phenotype harmonization that includes input from phenotype domain and study experts, quality control, documentation, reproducible results, and data sharing mechanisms. This system was developed for the National Heart, Lung and Blood Institute’s Trans-Omics for Precision Medicine (TOPMed) program, which is generating genomic and other omics data for >80 studies with extensive phenotype data. To date, 63 phenotypes have been harmonized across thousands of participants from up to 17 TOPMed studies per phenotype. We discuss the challenges faced in this undertaking and how they were addressed. The harmonized phenotype data and associated documentation have been submitted to National Institutes of Health data repositories for controlled-access by the scientific community. We also provide materials to facilitate future harmonization efforts by the community, which include (1) the code used to generate the 63 harmonized phenotypes, enabling others to reproduce, modify or extend these harmonizations to additional studies; and (2) results of labeling thousands of phenotype variables with controlled vocabulary terms.

Brian E. Cade | Jan Graffelman | Alex P. Reiner | Ming-Huei Chen | Braxton D. Mitchell | Mariza de Andrade | Shannon Kelly | Lawrence F. Bielak | Joshua C. Bis | Andrew D. Johnson | Rasika A. Mathias | Patricia A. Peyser | Jerome I. Rotter | Jennifer A. Smith | Lisa R. Yanek | L. Adrienne Cupples | Cathy C. Laurie | Pradeep Natarajan | Fei Fei Wang | Kathleen C. Barnes | Patrick T. Ellinor | Xiuqing Guo | Tanika N. Kelly | Charles Kooperberg | May E. Montasser | Gina M. Peloso | Daniel E. Weeks | Scott T. Weiss | Adolfo Correa | Kent D. Taylor | Stephen S. Rich | Nathan Pankratz | Myriam Fornage | Adrienne M. Stilp | Leslie S. Emery | Jai G. Broome | Erin J. Buth | Alyna T. Khan | Cecelia A. Laurie | Quenna Wong | Dongquan Chen | Catherine M. D’Augustine | Nancy L. Heard-Costa | Chancellor R. Hohensee | William Craig Johnson | Lucia D. Juarez | Jingmin Liu | Karen M. Mutalik | Laura M. Raffield | Kerri L. Wiggins | Paul S. de Vries | Donna K. Arnett | Stella Aslibekyan | Nora Franceschini | Weiniu Gan | Santhi K. Ganesh | Megan L. Grove | Nicola L. Hawley | Wan-Ling Hsu | Rebecca D. Jackson | Cashell E. Jaquish | Sharon LR Kardia | Jiwon Lee | Stephen T. McGarvey | Alanna C. Morrison | Kari E. North | Seyed Mehdi Nouraie | Elizabeth C. Oelsner | Ramachandran S. Vasan | Carla G. Wilson | Bruce M. Psaty | Susan R. Heckbert | M. Fornage | A. Reiner | D. Weeks | P. Ellinor | R. Vasan | C. Kooperberg | S. Weiss | S. Kardia | B. Psaty | K. Taylor | J. Rotter | B. Cade | K. Barnes | L. Bielak | P. Peyser | A. Stilp | C. Laurie | L. Cupples | Xiuqing Guo | J. Broome | R. Jackson | K. North | P. Natarajan | G. Peloso | J. Bis | S. Rich | Jennifer A. Smith | M. de Andrade | S. Heckbert | D. Arnett | N. Franceschini | L. Yanek | A. Correa | Jingmin Liu | T. Kelly | A. Morrison | S. Ganesh | B. Mitchell | M. Montasser | R. Mathias | Weiniu Gan | Ming-Huei Chen | N. Heard-Costa | M. Grove | M. Andrade | N. Pankratz | S. Aslibekyan | P. S. Vries | S. Kelly | C. Jaquish | J. Graffelman | C. Laurie | W. C. Johnson | S. Nouraie | L. Raffield | Q. Wong | P. D. de Vries | K. Wiggins | S. McGarvey | W. Hsu | K. Mutalik | Dongquan Chen | L. Emery | E. Oelsner | Jiwon Lee | Fei Fei Wang | Erin J Buth | Alyna T Khan | N. Hawley | C. Hohensee | Lucía D Juarez | C. Wilson | S. Weiss | A. Johnson | R. Jackson | S. Weiss | L. Juarez | B. Psaty | W. C. Johnson | Scott T. Weiss | A. Correa | Alexander P. Reiner | Chancellor Hohensee | Brian E. Cade | Jennifer A. Smith | K. Taylor | Jan Graffelman

[1]  R. Levy,et al.  Estimation of the concentration of low-density lipoprotein cholesterol in plasma, without use of the preparative ultracentrifuge. , 1972, Clinical chemistry.

[2]  N E Day,et al.  The detection of gene-environment interaction for continuous traits: should we deal with measurement error by bigger studies or better measurement? , 2003, International journal of epidemiology.

[3]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[4]  Stephen J Finch,et al.  Factors affecting statistical power in the detection of genetic association. , 2005, The Journal of clinical investigation.

[5]  K. Sirotkin,et al.  The NCBI dbGaP database of genotypes and phenotypes , 2007, Nature Genetics.

[6]  E. Regan,et al.  Genetic Epidemiology of COPD (COPDGene) Study Design , 2011, COPD.

[7]  Huaqin Pan,et al.  The PhenX Toolkit: Get the Most From Your Measures , 2011, American journal of epidemiology.

[8]  Peter Kraft,et al.  Phenotype harmonization and cross‐study collaboration in GWAS consortia: the GENEVA experience , 2011, Genetic epidemiology.

[9]  Esteban G Burchard,et al.  Early-life air pollution and asthma risk in minority children. The GALA II and SAGE II studies. , 2013, American journal of respiratory and critical care medicine.

[10]  Parminder Raina,et al.  Maelstrom Research guidelines for rigorous retrospective data harmonization , 2016, International journal of epidemiology.

[11]  Xue Zhong,et al.  A common TCN1 loss-of-function variant is associated with lower vitamin B12 concentration in African Americans. , 2018, Blood.

[12]  Andrew E Moran,et al.  Harmonization of Respiratory Data From 9 US Population-Based Cohorts: The NHLBI Pooled Cohorts Study , 2018, American journal of epidemiology.

[13]  Andrew D. Johnson,et al.  Impact of Rare and Common Genetic Variants on Diabetes Diagnosis by Hemoglobin A1c in Multi-Ancestry Cohorts: The Trans-Omics for Precision Medicine Program. , 2019, American journal of human genetics.

[14]  Jianwen Cai,et al.  Use of >100,000 NHLBI Trans-Omics for Precision Medicine (TOPMed) Consortium whole genome sequences improves imputation quality and detection of rare variant associations in admixed African and Hispanic/Latino populations , 2019, PLoS genetics.

[15]  Christophe Hurlin,et al.  Certify reproducibility with confidential data , 2019, Science.

[16]  Tamar Sofer,et al.  A Fully-Adjusted Two-Stage Procedure for Rank Normalization in Genetic Association Studies , 2018, bioRxiv.