Efficient inference for genetic association studies with multiple outcomes

Summary Combined inference for heterogeneous high‐dimensional data is critical in modern biology, where clinical and various kinds of molecular data may be available from a single study. Classical genetic association studies regress a single clinical outcome on many genetic variants one by one, but there is an increasing demand for joint analysis of many molecular outcomes and genetic variants in order to unravel functional interactions. Unfortunately, most existing approaches to joint modeling are either too simplistic to be powerful or are impracticable for computational reasons. Inspired by Richardson and others (2010, Bayesian Statistics 9), we consider a sparse multivariate regression model that allows simultaneous selection of predictors and associated responses. As Markov chain Monte Carlo (MCMC) inference on such models can be prohibitively slow when the number of genetic variants exceeds a few thousand, we propose a variational inference approach which produces posterior information very close to that of MCMC inference, at a much reduced computational cost. Extensive numerical experiments show that our approach outperforms popular variable selection methods and tailored Bayesian procedures, dealing within hours with problems involving hundreds of thousands of genetic variants and tens to hundreds of clinical or molecular outcomes.

[1]  James G. Scott,et al.  Bayes and empirical-Bayes multiplicity adjustment in the variable-selection problem , 2010, 1011.2333.

[2]  Peggy Hall,et al.  The NHGRI GWAS Catalog, a curated resource of SNP-trait associations , 2013, Nucleic Acids Res..

[3]  E. George,et al.  APPROACHES FOR BAYESIAN VARIABLE SELECTION , 1997 .

[4]  Jaime Prilusky,et al.  GeneCards: a novel functional genomics compendium with automated data mining and query reformulation support , 1998, Bioinform..

[5]  J. Tukey Comparing individual means in the analysis of variance. , 1949, Biometrics.

[6]  P. O’Reilly,et al.  MultiPhen: Joint Model of Multiple Phenotypes Can Increase Discovery in GWAS , 2012, PloS one.

[7]  Hagai Attias,et al.  A Variational Bayesian Framework for Graphical Models , 1999 .

[8]  R. Carroll,et al.  Distribution of allele frequencies and effect sizes and their interrelationships for common genetic susceptibility variants , 2011, Proceedings of the National Academy of Sciences.

[9]  David J. Spiegelhalter,et al.  Microarrays, Empirical Bayes and the Two-Groups Model. Comment. , 2008 .

[10]  Audrey Y. Chu,et al.  Genetic loci associated with circulating levels of very long-chain saturated fatty acids[S] , 2015, Journal of Lipid Research.

[11]  M. Plummer,et al.  CODA: convergence diagnosis and output analysis for MCMC , 2006 .

[12]  Runze Li,et al.  A block mixture model to map eQTLs for gene clustering and networking , 2016, Scientific Reports.

[13]  M. Stephens,et al.  Scalable Variational Inference for Bayesian Variable Selection in Regression, and Its Accuracy in Genetic Association Studies , 2012 .

[14]  N. Higham Computing the nearest correlation matrix—a problem from finance , 2002 .

[15]  Ellen T. Gelfand,et al.  The Genotype-Tissue Expression (GTEx) project , 2013, Nature Genetics.

[16]  M. Stephens,et al.  A Statistical Framework for Joint eQTL Analysis in Multiple Tissues , 2012, PLoS genetics.

[17]  Rongling Wu,et al.  2HiGWAS: a unifying high-dimensional platform to infer the global genetic architecture of trait development , 2015, Briefings Bioinform..

[18]  M. Stephens,et al.  Bayesian variable selection regression for genome-wide association studies and other large-scale problems , 2011, 1110.6019.

[19]  S. Richardson,et al.  Bayesian Models for Sparse Regression Analysis of High Dimensional Data , 2012 .

[20]  F. Agakov,et al.  Abundant pleiotropy in human complex diseases and traits. , 2011, American journal of human genetics.

[21]  Jeffrey S. Morris,et al.  Sure independence screening for ultrahigh dimensional feature space Discussion , 2008 .

[22]  S. Purcell,et al.  Pleiotropy in complex traits: challenges and strategies , 2013, Nature Reviews Genetics.

[23]  J. S. Rao,et al.  Spike and slab variable selection: Frequentist and Bayesian strategies , 2005, math/0505633.

[24]  F. Liang,et al.  A split‐and‐merge Bayesian variable selection approach for ultrahigh dimensional regression , 2015 .

[25]  Michael I. Jordan,et al.  A generalized mean field algorithm for variational inference in exponential families , 2002, UAI.

[26]  J. Berger,et al.  Optimal predictive model selection , 2004, math/0406464.

[27]  Xiang Zhou,et al.  Efficient Algorithms for Multivariate Linear Mixed Models in Genome-wide Association Studies , 2013, Nature Methods.

[28]  A. Lusis,et al.  Systems genetics approaches to understand complex traits , 2013, Nature Reviews Genetics.

[29]  R. Kohn,et al.  Parallel Variational Bayes for Large Datasets With an Application to Generalized Linear Mixed Models , 2016 .

[30]  Jianqing Fan,et al.  Sure independence screening for ultrahigh dimensional feature space , 2006, math/0612857.

[31]  Jingyuan Fu,et al.  Genetical Genomics: Spotlight on QTL Hotspots , 2008, PLoS genetics.

[32]  Juha Karhunen,et al.  Natural Conjugate Gradient in Variational Inference , 2007, ICONIP.

[33]  M. Wand,et al.  Explaining Variational Approximations , 2010 .

[34]  Irene A. Stegun,et al.  Handbook of Mathematical Functions. , 1966 .

[35]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[36]  Benjamin J. Keller,et al.  Genome-Wide Association and Trans-ethnic Meta-Analysis for Advanced Diabetic Kidney Disease: Family Investigation of Nephropathy and Diabetes (FIND) , 2015, PLoS genetics.

[37]  R. Gottardo,et al.  An Integrated Hierarchical Bayesian Model for Multivariate eQTL Mapping , 2012, Statistical applications in genetics and molecular biology.

[38]  Shizhong Xu,et al.  Mapping Quantitative Trait Loci for Expression Abundance , 2007, Genetics.

[39]  W Y Zhang,et al.  Discussion on `Sure independence screening for ultra-high dimensional feature space' by Fan, J and Lv, J. , 2008 .

[40]  Hongyu Wu,et al.  Genetic loci associated with circulating phospholipid trans fatty acids: a meta-analysis of genome-wide association studies from the CHARGE Consortium. , 2015, The American journal of clinical nutrition.

[41]  T. Larsen,et al.  The Diet, Obesity and Genes (Diogenes) Dietary Study in eight European countries – a comprehensive design for long‐term intervention , 2010, Obesity reviews : an official journal of the International Association for the Study of Obesity.

[42]  Terrence S. Furey,et al.  The UCSC Genome Browser Database , 2003, Nucleic Acids Res..