UK-Biobank Whole Exome Sequence Binary Phenome Analysis with Robust Region-based Rare Variant Test

In biobank data analysis, most binary phenotypes have unbalanced case-control ratios, which can cause inflation of type I error rates. Recently, a saddlepoint approximation (SPA) based single variant test has been developed to provide an accurate and scalable method to test for associations of such phenotypes. For gene- or region-based multiple variant tests, a few methods exist which adjust for unbalanced case-control ratios; however, these methods are either less accurate when case-control ratios are extremely unbalanced or not scalable for large data analyses. To address these problems, we propose SKAT/SKAT-O type region-based tests, where the single-variant score statistic is calibrated based on SPA and Efficient Resampling (ER). Through simulation studies, we show that the proposed method provides well-calibrated p-values. In contrast, the unadjusted approach has greatly inflated type I error rates (90 times of exome-wide α =2.5×10-6) when the case-control ratio is 1:99. Additionally, the proposed method has similar computation time as the unadjusted approaches and is scalable for large sample data. Our UK Biobank whole exome sequence data analysis of 45,596 unrelated European samples and 791 PheCode phenotypes identified 10 rare variant associations with p-value < 10-7, including the associations between JAK2 and myeloproliferative disease, TNC and large cell lymphoma and F11 and congenital coagulation defects. All analysis summary results are publicly available through a web-based visual server.

[1]  S. Leal,et al.  Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. , 2008, American journal of human genetics.

[2]  Kari Stefansson,et al.  Genome-wide analyses using UK Biobank data provide insights into the genetic architecture of osteoarthritis , 2018, Nature Genetics.

[3]  M. Cazzola,et al.  From Janus kinase 2 to calreticulin: the clinically relevant genomic landscape of myeloproliferative neoplasms. , 2014, Blood.

[4]  G. Abecasis,et al.  Rare-variant association analysis: study designs and statistical tests. , 2014, American journal of human genetics.

[5]  W. Thilly,et al.  A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: a cohort allelic sums test (CAST). , 2007, Mutation research.

[6]  Xihong Lin,et al.  Rare-variant association testing for sequencing data with the sequence kernel association test. , 2011, American journal of human genetics.

[7]  E. Davie,et al.  Organization of the gene for human factor XI. , 1987, Biochemistry.

[8]  Michael Boehnke,et al.  Recommended Joint and Meta‐Analysis Strategies for Case‐Control Association Testing of Single Low‐Count Variants , 2013, Genetic epidemiology.

[9]  Dana C. Crawford,et al.  Unravelling the human genome–phenome relationship using phenome-wide association studies , 2016, Nature Reviews Genetics.

[10]  H. Daniels Saddlepoint Approximations in Statistics , 1954 .

[11]  Melissa A. Basford,et al.  Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data , 2013, Nature Biotechnology.

[12]  R. Chiquet‐Ehrismann,et al.  Tenascin-C induced signaling in cancer. , 2006, Cancer letters.

[13]  Xihong Lin,et al.  Optimal tests for rare variant effects in sequencing association studies. , 2012, Biostatistics.

[14]  D. Kuonen Saddlepoint approximations for distributions of quadratic forms in normal variables , 1999 .

[15]  Xinyuan Zhang,et al.  Real world scenarios in rare variant association analysis: the impact of imbalance and sample size on the power in silico , 2019, BMC Bioinformatics.

[16]  J. Carpten,et al.  Germline mutations in HOXB13 and prostate-cancer risk. , 2012, The New England journal of medicine.

[17]  Marylyn D. Ritchie,et al.  Distribution and clinical impact of functional variants in 50,726 whole-exome sequences from the DiscovEHR study , 2016, Science.

[18]  F. Collins,et al.  A new initiative on precision medicine. , 2015, The New England journal of medicine.

[19]  S. Redline,et al.  Control for Population Structure and Relatedness for Binary Traits in Genetic Association Studies via Logistic Mixed Models. , 2016, American journal of human genetics.

[20]  Wei Zhou,et al.  Scalable generalized linear mixed model for region-based association tests in large biobanks and cohorts , 2019, Nature Genetics.

[21]  Seunggeun Lee,et al.  An efficient resampling method for calibrating single and gene-based rare variant association analysis in case-control studies. , 2016, Biostatistics.

[22]  David M. Wilson,et al.  Urea Cycle Dysregulation Generates Clinically Relevant Genomic and Biochemical Signatures , 2018, Cell.

[23]  P. Donnelly,et al.  The UK Biobank resource with deep phenotyping and genomic data , 2018, Nature.

[24]  P. Campbell,et al.  Acquired mutation of the tyrosine kinase JAK2 in human myeloproliferative disorders , 2005, The Lancet.

[25]  S. Gabriel,et al.  Calibrating a coalescent simulation of human genome sequence variation. , 2005, Genome research.

[26]  Gonçalo Abecasis,et al.  Whole exome sequencing and characterization of coding variation in 49,960 individuals in the UK Biobank , 2019, bioRxiv.

[27]  Lars G Fritsche,et al.  Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies , 2017, Nature Genetics.

[28]  T. Park,et al.  Comparing family-based rare variant association tests for dichotomous phenotypes , 2016, BMC Proceedings.

[29]  Seunggeun Lee,et al.  A fast and accurate algorithm to test for binary phenotypes and its application to PheWAS , 2017, bioRxiv.