SAIGEgds - an efficient statistical tool for large-scale PheWAS with mixed models

SUMMARY Phenome-wide association studies (PheWASs) are known to be a powerful tool in discovery and replication of genetic association studies. To reduce the computational burden of PheWAS in the large cohorts such as the UK Biobank, the SAIGE method has been proposed to control for case-control imbalance and sample relatedness in a tractable manner. However, SAIGE is still computationally intensive when deployed in analyzing the associations of thousands of ICD10-coded phenotypes with whole-genome imputed genotype data. Here we present a new high-performance statistical R package (SAIGEgds) for large-scale PheWAS using generalized linear mixed models. The package implements the SAIGE method in optimized C ++ codes, taking advantage of sparse genotype dosages and integrating the efficient genomic data structure (GDS) file format. Benchmarks using the UK Biobank White British genotype data (N ≈ 430K) with coronary heart disease and simulated cases show that the implementation in SAIGEgds is 5 to 6 times faster than the SAIGE R package. When used in conjunction with high-performance computing clusters, SAIGEgds provides an efficient analysis pipeline for biobank-scale PheWAS. AVAILABILITY AND IMPLEMENTATION https://bioconductor.org/packages/SAIGEgds; vignettes included. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  Melissa A. Basford,et al.  Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data , 2013, Nature Biotechnology.

[2]  Xihong Lin,et al.  ACAT: A Fast and Powerful p Value Combination Method for Rare-Variant Analysis in Sequencing Studies. , 2019, American journal of human genetics.

[3]  P. Donnelly,et al.  The UK Biobank resource with deep phenotyping and genomic data , 2018, Nature.

[4]  B. Berger,et al.  Efficient Bayesian mixed model analysis increases association power in large cohorts , 2014, Nature Genetics.

[5]  Tamar Sofer,et al.  Genetic association testing using the GENESIS R/Bioconductor package , 2019, Bioinform..

[6]  Po-Ru Loh,et al.  Mixed-model association for biobank-scale datasets , 2018, Nature Genetics.

[7]  Francesca N. Delling,et al.  Heart Disease and Stroke Statistics—2019 Update: A Report From the American Heart Association , 2019, Circulation.

[8]  Andrew Carroll,et al.  Analysis commons, a team approach to discovery in a big-data environment for genetic epidemiology , 2017, Nature Genetics.

[9]  Lars G Fritsche,et al.  Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies , 2017, Nature Genetics.

[10]  David Levine,et al.  GWASTools: an R/Bioconductor package for quality control and analysis of genome-wide association studies , 2012, Bioinform..

[11]  David Levine,et al.  SeqArray—a storage‐efficient high‐performance data format for WGS variant calls , 2017, Bioinform..

[12]  David Levine,et al.  A high-performance computing toolset for relatedness and principal component analysis of SNP data , 2012, Bioinform..