A Fast and Accurate Method for Genome-Wide Time-to-Event Data Analysis and Its Application to UK Biobank.

With increasing biobanking efforts connecting electronic health records and national registries to germline genetics, the time-to-event data analysis has attracted increasing attention in the genetics studies of human diseases. In time-to-event data analysis, the Cox proportional hazards (PH) regression model is one of the most used approaches. However, existing methods and tools are not scalable when analyzing a large biobank with hundreds of thousands of samples and endpoints, and they are not accurate when testing low-frequency and rare variants. Here, we propose a scalable and accurate method, SPACox (a saddlepoint approximation implementation based on the Cox PH regression model), that is applicable for genome-wide scale time-to-event data analysis. SPACox requires fitting a Cox PH regression model only once across the genome-wide analysis and then uses a saddlepoint approximation (SPA) to calibrate the test statistics. Simulation studies show that SPACox is 76-252 times faster than other existing alternatives, such as gwasurvivr, 185-511 times faster than the standard Wald test, and more than 6,000 times faster than the Firth correction and can control type I error rates at the genome-wide significance level regardless of minor allele frequencies. Through the analysis of UK Biobank inpatient data of 282,871 white British European ancestry samples, we show that SPACox can efficiently analyze large sample sizes and accurately control type I error rates. We identified 611 loci associated with time-to-event phenotypes of 12 common diseases, of which 38 loci would be missed within a logistic regression framework with a binary phenotype defined as event occurrence status during the follow-up period.

[1]  B. Berger,et al.  Efficient Bayesian mixed model analysis increases association power in large cohorts , 2014, Nature Genetics.

[2]  H. Hakonarson,et al.  ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data , 2010, Nucleic acids research.

[3]  W. Timens,et al.  Combining genomewide association study and lung eQTL analysis provides evidence for novel genes associated with asthma , 2016, Allergy.

[4]  Guolian Kang,et al.  Statistical selection of biological models for genome-wide association analyses. , 2018, Methods.

[5]  David Levine,et al.  GWASTools: an R/Bioconductor package for quality control and analysis of genome-wide association studies , 2012, Bioinform..

[6]  Andreas Busjahn,et al.  PDE3A mutations cause autosomal dominant hypertension with brachydactyly , 2015, Nature Genetics.

[7]  Chong Shen,et al.  Genome-Wide Association Study Identifies 8 Novel Loci Associated With Blood Pressure Responses to Interventions in Han Chinese , 2013, Circulation. Cardiovascular genetics.

[8]  R. Gill,et al.  Cox's regression model for counting processes: a large sample study : (preprint) , 1982 .

[9]  Rounak Dey,et al.  Technical Note: Efficient and accurate estimation of genotype odds ratios in biobank-based unbalanced case-control studies , 2019, bioRxiv.

[10]  Edmund Jones,et al.  A comparison of Cox and logistic regression for use in genome-wide association studies of cohort and case-cohort design , 2017, European Journal of Human Genetics.

[11]  L. J. Wei,et al.  The Robust Inference for the Cox Proportional Hazards Model , 1989 .

[12]  Eric Boerwinkle,et al.  Genomic Association Analysis Reveals Variants Associated With Blood Pressure Response to Beta‐Blockers in European Americans , 2019, Clinical and translational science.

[13]  D. Cox Regression Models and Life-Tables , 1972 .

[14]  Xihong Lin,et al.  Kernel machine SNP‐set analysis for censored survival outcomes in genome‐wide association studies , 2011, Genetic epidemiology.

[15]  Oscar Harari,et al.  Genome-wide survival analysis of age at onset of alcohol dependence in extended high-risk COGA families. , 2014, Drug and alcohol dependence.

[16]  Seunggeun Lee,et al.  Robust meta‐analysis of biobank‐based genome‐wide association studies with unbalanced binary phenotypes , 2019, Genetic epidemiology.

[17]  Christian Gieger,et al.  Gene-centric meta-analysis in 87,736 individuals of European ancestry identifies multiple blood-pressure-related loci. , 2014, American journal of human genetics.

[18]  Lisa Bastarache,et al.  Mapping ICD-10 and ICD-10-CM Codes to Phecodes: Workflow Development and Initial Evaluation , 2019, JMIR Medical Informatics.

[19]  Thomas Lumley,et al.  Sequence Kernel Association Test for Survival Traits , 2014, Genetic epidemiology.

[20]  Xiaofeng Zhu,et al.  The genetics of blood pressure regulation and its target organs from association studies in 342,415 individuals , 2016, Nature Genetics.

[21]  Dennis J. Hazelett,et al.  Identification of a Novel Mucin Gene HCG22 Associated With Steroid-Induced Ocular Hypertension. , 2015, Investigative ophthalmology & visual science.

[22]  Marie-Pierre Dubé,et al.  genipe: an automated genome-wide imputation pipeline with automatic reporting and statistical tools , 2016, Bioinform..

[23]  Andrew P. Morris,et al.  SurvivalGWAS_SV: software for the analysis of genome-wide association studies of imputed genotypes with “time-to-event” outcomes , 2017, BMC Bioinformatics.

[24]  Qingxia Chen,et al.  Cox regression increases power to detect genotype-phenotype associations in genomic studies using the electronic health record , 2019, BMC Genomics.

[25]  Mark I. McCarthy,et al.  Genome-Wide Association Study Reveals Multiple Loci Associated with Primary Tooth Development during Infancy , 2010, PLoS genetics.

[26]  Seunggeun Lee,et al.  A fast and accurate algorithm to test for binary phenotypes and its application to PheWAS , 2017, bioRxiv.

[27]  Ole E. Barndorff-Nielsen,et al.  Approximate Interval Probabilities , 1990 .

[28]  Weiwei An,et al.  Coronary-Heart-Disease-Associated Genetic Variant at the COL4A1/COL4A2 Locus Affects COL4A1/COL4A2 Expression, Vascular Cell Survival, Atherosclerotic Plaque Stability and Risk of Myocardial Infarction , 2016, PLoS genetics.

[29]  David M. Thomas,et al.  Genome‐wide association study identifies the GLDC/IL33 locus associated with survival of osteosarcoma patients , 2018, International journal of cancer.

[30]  Seunggeun Lee,et al.  A Fast and Accurate Method for Genome-Wide Scale Phenome-Wide G × E Analysis and Its Application to UK Biobank. , 2019, American journal of human genetics.

[31]  Steven Gallinger,et al.  Genome-wide scan of the effect of common nsSNPs on colorectal cancer survival outcome , 2018, British Journal of Cancer.

[32]  Shrikant I. Bangdiwala,et al.  The Wald Statistic in Proportional Hazards Hypothesis Testing , 1989 .

[33]  Marylyn D. Ritchie,et al.  PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene–disease associations , 2010, Bioinform..

[34]  Martin Morgan,et al.  gwasurvivr: an R package for genome-wide survival analysis , 2019, Bioinform..

[35]  Wei Zhou,et al.  Scalable generalized linear mixed model for region-based association tests in large biobanks and cohorts , 2020, Nature Genetics.

[36]  P. Elliott,et al.  UK Biobank: An Open Access Resource for Identifying the Causes of a Wide Range of Complex Diseases of Middle and Old Age , 2015, PLoS medicine.

[37]  Peter Kraft,et al.  A Genome-Wide Association Study of Prognosis in Breast Cancer , 2010, Cancer Epidemiology, Biomarkers & Prevention.

[38]  P. Grambsch,et al.  Modeling Survival Data: Extending the Cox Model , 2000 .

[39]  Alan M. Kwong,et al.  A reference panel of 64,976 haplotypes for genotype imputation , 2015, Nature Genetics.

[40]  Erik Larsson,et al.  Hypertension and Genetic Variation in Endothelial-Specific Genes , 2013, PloS one.

[41]  Seunggeun Lee,et al.  UK Biobank Whole-Exome Sequence Binary Phenome Analysis with Robust Region-based Rare-Variant Test. , 2019, American journal of human genetics.

[42]  David C Christiani,et al.  Genome-wide analysis of survival in early-stage non-small-cell lung cancer. , 2009, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[43]  Yi Zhang,et al.  Genetic polymorphisms of HSP70 in age-related cataract , 2013, Cell Stress and Chaperones.

[44]  Daniel I Chasman,et al.  Common genetic variations in the vitamin D pathway in relation to blood pressure. , 2014, American journal of hypertension.

[45]  P. Grambsch,et al.  Martingale-based residuals for survival models , 1990 .

[46]  A. Feuerverger,et al.  On the empirical saddlepoint approximation , 1989 .

[47]  Kari Stefansson,et al.  Several common variants modulate heart rate, PR interval and QRS duration , 2010, Nature Genetics.

[48]  Clara Diaz,et al.  Identifying large sets of unrelated individuals and unrelated markers , 2014, Source Code for Biology and Medicine.

[49]  H. Daniels Saddlepoint Approximations in Statistics , 1954 .

[50]  Lars G Fritsche,et al.  The emerging landscape of health research based on biobanks linked to electronic health records: Existing resources, statistical challenges, and potential opportunities , 2019, Statistics in medicine.

[51]  David P. Harrington,et al.  Supremum versions of the log-rank and generalized wilcoxon statistics , 1987 .

[52]  P. Donnelly,et al.  The UK Biobank resource with deep phenotyping and genomic data , 2018, Nature.

[53]  Lars G Fritsche,et al.  Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies , 2017, Nature Genetics.