SEAGLE: A Scalable Exact Algorithm for Large-Scale Set-Based GxE Tests in Biobank Data

The explosion of biobank data offers immediate opportunities for gene-environment (GxE) interaction studies of complex diseases because of the large sample sizes and the rich collection in genetic and non-genetic information. However, the extremely large sample size also introduces new computational challenges in G×E assessment, especially for set-based G×E variance component (VC) tests, which is a widely used strategy to boost overall G×E signals and to evaluate the joint G×E effect of multiple variants from a biologically meaningful unit (e.g., gene). In this work, focusing on continuous traits, we present SEAGLE, a Scalable Exact AlGorithm for Large-scale set-based GxE test, to permit G×E VC tests for biobank-scale data. SEAGLE employs modern matrix computations to achieve the same “exact” results as the original G×E VC tests, and does not impose additional assumptions nor relies on approximations. SEAGLE can easily accommodate sample sizes in the order of 105, is implementable on standard laptops, and does not require specialized computing equipment. We demonstrate the performance of SEAGLE using extensive simulations. We illustrate its utility by conducting genome-wide gene-based G×E analysis on the Taiwan Biobank data to explore the interaction of gene and physical activity status on body mass index.

[1]  Li Hsu,et al.  A unified powerful set-based test for sequencing data analysis of GxE interactions. , 2017, Biostatistics.

[2]  Z. Kutalik,et al.  Quantification of the overall contribution of gene-environment interaction for obesity-related traits , 2020, Nature Communications.

[3]  Tsippi Iny Stein,et al.  The GeneCards Suite: From Gene Data Mining to Disease Genome Sequence Analyses , 2016, Current protocols in bioinformatics.

[4]  D. Hunter Gene–environment interactions in human diseases , 2005, Nature Reviews Genetics.

[5]  Jung-Ying Tzeng,et al.  Studying gene and gene-environment effects of uncommon and common variants on continuous traits: a marker-set approach using gene-trait similarity regression. , 2011, American journal of human genetics.

[6]  Ilse C. F. Ipsen,et al.  A Projector-Based Approach to Quantifying Total and Excess Uncertainties for Sketched Linear Regression , 2018 .

[7]  Fabien C. Lamaze,et al.  Gene-by-environment interactions in urban populations modulate risk phenotypes , 2018, Nature Communications.

[8]  J. Mount Importance Sampling , 2005 .

[9]  Yue Wu,et al.  A scalable estimator of SNP heritability for biobank-scale data , 2018, bioRxiv.

[10]  S. Gabriel,et al.  Calibrating a coalescent simulation of human genome sequence variation. , 2005, Genome research.

[11]  Ilse C. F. Ipsen,et al.  Randomized matrix-free trace and log-determinant estimators , 2016, Numerische Mathematik.

[12]  Ilse C. F. Ipsen,et al.  Importance Sampling for a Monte Carlo Matrix Multiplication Algorithm, with Application to Information Retrieval , 2011, SIAM J. Sci. Comput..

[13]  K. Reinhart,et al.  Transcription in response to physical stress—clues to the molecular mechanisms of exercise‐induced asthma , 2005, FASEB journal : official publication of the Federation of American Societies for Experimental Biology.

[14]  T. Lumley,et al.  FastSKAT: Sequence kernel association tests for very large sets of markers , 2018, Genetic epidemiology.

[15]  K Alaine Broadaway,et al.  Kernel Approach for Modeling Interaction Effects in Genetic Association Studies of Complex Quantitative Traits , 2015, Genetic epidemiology.

[16]  R. Davies The distribution of a linear combination of 2 random variables , 1980 .

[17]  Peter Kraft,et al.  Lessons Learned From Past Gene-Environment Interaction Successes , 2017, American journal of epidemiology.

[18]  Nicholas J. Higham,et al.  INVERSE PROBLEMS NEWSLETTER , 1991 .

[19]  Bhramar Mukherjee,et al.  Current Challenges and New Opportunities for Gene-Environment Interaction Studies of Complex Diseases. , 2017, American journal of epidemiology.

[20]  E. Boerwinkle,et al.  Efficient gene-environment interaction tests for large biobank-scale sequencing studies , 2020, bioRxiv.

[21]  Ilse C. F. Ipsen,et al.  Randomized Approximation of the Gram Matrix: Exact Computation and Probabilistic Bounds , 2015, SIAM J. Matrix Anal. Appl..

[22]  Seunggeun Lee,et al.  Test for rare variants by environment interactions in sequencing association studies , 2016, Biometrics.

[23]  Ilse C. F. Ipsen,et al.  The Effect of Coherence on Sampling from Matrices with Orthonormal Columns, and Preconditioned Least Squares Problems , 2014, SIAM J. Matrix Anal. Appl..

[24]  P. Arner,et al.  ALOX5AP expression, but not gene haplotypes, is associated with obesity and insulin resistance , 2006, International Journal of Obesity.

[25]  Ilse C. F. Ipsen,et al.  A Probabilistic Subspace Bound with Application to Active Subspaces , 2018, SIAM J. Matrix Anal. Appl..

[26]  Stephen R. Williams,et al.  A Fast Multiple‐Kernel Method With Applications to Detect Gene‐Environment Interaction , 2015, Genetic epidemiology.

[27]  Xihong Lin,et al.  Test for interactions between a genetic marker set and environment in generalized linear models. , 2013, Biostatistics.

[28]  Arnab Maity,et al.  Complete Effect‐Profile Assessment in Association Studies With Multiple Genetic and Multiple Environmental Factors , 2015, Genetic epidemiology.

[29]  Jung-Ying Tzeng,et al.  Assessing Gene-Environment Interactions for Common and Rare Variants with Binary Traits Using Gene-Trait Similarity Regression , 2015, Genetics.

[30]  Ilse C. F. Ipsen,et al.  kappa_SQ: A Matlab package for randomized sampling of matrices with orthonormal columns , 2014, ArXiv.

[31]  Huan Liu,et al.  A new chi-square approximation to the distribution of non-negative definite quadratic forms in non-central normal variables , 2009, Comput. Stat. Data Anal..