Cox regression increases power to detect genotype-phenotype associations in genomic studies using the electronic health record

BackgroundThe growth of DNA biobanks linked to data from electronic health records (EHRs) has enabled the discovery of numerous associations between genomic variants and clinical phenotypes. Nonetheless, although clinical data are generally longitudinal, standard approaches for detecting genotype-phenotype associations in such linked data, notably logistic regression, do not naturally account for variation in the period of follow-up or the time at which an event occurs. Here we explored the advantages of quantifying associations using Cox proportional hazards regression, which can account for the age at which a patient first visited the healthcare system (left truncation) and the age at which a patient either last visited the healthcare system or acquired a particular phenotype (right censoring).ResultsIn comprehensive simulations, we found that, compared to logistic regression, Cox regression had greater power at equivalent Type I error. We then scanned for genotype-phenotype associations using logistic regression and Cox regression on 50 phenotypes derived from the EHRs of 49,792 genotyped individuals. Consistent with the findings from our simulations, Cox regression had approximately 10% greater relative sensitivity for detecting known associations from the NHGRI-EBI GWAS Catalog. In terms of effect sizes, the hazard ratios estimated by Cox regression were strongly correlated with the odds ratios estimated by logistic regression.ConclusionsAs longitudinal health-related data continue to grow, Cox regression may improve our ability to identify the genetic basis for a wide range of human phenotypes.

[1]  Paul A. Harris,et al.  Secondary use of clinical data: The Vanderbilt approach , 2014, J. Biomed. Informatics.

[2]  Edmund Jones,et al.  A comparison of Cox and logistic regression for use in genome-wide association studies of cohort and case-cohort design , 2017, European Journal of Human Genetics.

[3]  P. Visscher,et al.  Advantages and pitfalls in the application of mixed-model association methods , 2014, Nature Genetics.

[4]  D. Collet Modelling Survival Data in Medical Research , 2004 .

[5]  D. Cox Regression Models and Life-Tables , 1972 .

[6]  Sayan Mukherjee,et al.  Fast Principal-Component Analysis Reveals Convergent Evolution of ADH1B in Europe and East Asia. , 2016, American journal of human genetics.

[7]  Two-sample tests for survival data from observational studies , 2018, Lifetime data analysis.

[8]  David Levine,et al.  A high-performance computing toolset for relatedness and principal component analysis of SNP data , 2012, Bioinform..

[9]  J. Denny,et al.  The "All of Us" Research Program. , 2019, The New England journal of medicine.

[10]  A. Philippakis,et al.  The "All of Us" Research Program. , 2019, The New England journal of medicine.

[11]  P. Donnelly,et al.  Inference of population structure using multilocus genotype data. , 2000, Genetics.

[12]  L A Beckett,et al.  Age-specific incidence of Alzheimer's disease in a community population. , 1995, JAMA.

[13]  Melissa A. Basford,et al.  Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data , 2013, Nature Biotechnology.

[14]  Martin Morgan,et al.  gwasurvivr: an R package for genome-wide survival analysis , 2019, Bioinform..

[15]  Mitchell J. Machiela,et al.  LDlink: a web-based application for exploring population-specific haplotype structure and linking correlated alleles of possible functional variants , 2015, Bioinform..

[16]  E. Steyerberg,et al.  Cox proportional hazards models have more statistical power than logistic regression models in cross-sectional genetic association studies , 2008, European Journal of Human Genetics.

[17]  M. Schemper,et al.  The estimation of average hazard ratios by weighted Cox regression , 2009, Statistics in medicine.

[18]  R. Gill,et al.  Cox's regression model for counting processes: a large sample study : (preprint) , 1982 .

[19]  J. Baskerville,et al.  The natural history of multiple sclerosis: a geographically based study. 5. The clinical features and natural history of primary progressive multiple sclerosis. , 1999, Brain : a journal of neurology.

[20]  D. Roden,et al.  The Influence of Big (Clinical) Data and Genomics on Precision Medicine and Drug Development , 2018, Clinical pharmacology and therapeutics.

[21]  E. Scott Modelling Survival Data in Medical Research , 1995 .

[22]  Peter Donnelly,et al.  Bayesian analysis of genetic association across tree-structured routine healthcare data in the UK Biobank , 2017, Nature Genetics.

[23]  Andrew P. Morris,et al.  SurvivalGWAS_SV: software for the analysis of genome-wide association studies of imputed genotypes with “time-to-event” outcomes , 2017, BMC Bioinformatics.

[24]  J. Baskerville,et al.  The natural history of multiple sclerosis: a geographically based study. 7. Progressive-relapsing and relapsing-progressive multiple sclerosis: a re-evaluation. , 1999, Brain : a journal of neurology.

[25]  Yi Li,et al.  Conditional screening for ultra-high dimensional covariates with survival outcomes , 2016, Lifetime data analysis.

[26]  Alexander E. Lopez,et al.  Profiling and leveraging relatedness in a precision medicine cohort of 92,455 exomes , 2017, bioRxiv.

[27]  Henrik Grönberg,et al.  Prostate cancer epidemiology , 2003, The Lancet.

[28]  ZhengXiuwen,et al.  A high-performance computing toolset for relatedness and principal component analysis of SNP data , 2012 .

[29]  David E Frost,et al.  All of us. , 2011, Journal of oral and maxillofacial surgery : official journal of the American Association of Oral and Maxillofacial Surgeons.

[30]  K R Hess,et al.  Assessing time-by-covariate interactions in proportional hazards regression models using cubic spline functions. , 1994, Statistics in medicine.

[31]  Hilde van der Togt,et al.  Publisher's Note , 2003, J. Netw. Comput. Appl..

[32]  Peter Kraft,et al.  Maximizing the power of principal-component analysis of correlated phenotypes in genome-wide association studies. , 2014, American journal of human genetics.

[33]  Mary Brophy,et al.  Million Veteran Program: A mega-biobank to study genetic influences on health and disease. , 2016, Journal of clinical epidemiology.

[34]  Helen E. Parkinson,et al.  The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019 , 2018, Nucleic Acids Res..

[35]  Carson C Chow,et al.  Second-generation PLINK: rising to the challenge of larger and richer datasets , 2014, GigaScience.