A Sparse Graph-Structured Lasso Mixed Model for Genetic Association with Confounding Correction

While linear mixed model (LMM) has shown a competitive performance in correcting spurious associations raised by population stratification, family structures, and cryptic relatedness, more challenges are still to be addressed regarding the complex structure of genotypic and phenotypic data. For example, geneticists have discovered that some clusters of phenotypes are more co-expressed than others. Hence, a joint analysis that can utilize such relatedness information in a heterogeneous data set is crucial for genetic modeling. We proposed the sparse graph-structured linear mixed model (sGLMM) that can incorporate the relatedness information from traits in a dataset with confounding correction. Our method is capable of uncovering the genetic associations of a large number of phenotypes together while considering the relatedness of these phenotypes. Through extensive simulation experiments, we show that the proposed model outperforms other existing approaches and can model correlation from both population structure and shared signals. Further, we validate the effectiveness of sGLMM in the real-world genomic dataset on two different species from plants and humans. In Arabidopsis thaliana data, sGLMM behaves better than all other baseline models for 63.4% traits. We also discuss the potential causal genetic variation of Human Alzheimer's disease discovered by our model and justify some of the most important genetic loci.

[1]  Juhyun Song,et al.  miR-155 is involved in Alzheimer’s disease by regulating T lymphocyte function , 2015, Front. Aging Neurosci..

[2]  Michael W. Weiner,et al.  Discovery and Replication of Gene Influences on Brain Structure Using LASSO Regression , 2012, Front. Neurosci..

[3]  P. Visscher,et al.  Mixed model with correction for case-control ascertainment increases association power. , 2015, American journal of human genetics.

[4]  Zhiwu Zhang,et al.  Mixed linear model approach adapted for genome-wide association studies , 2010, Nature Genetics.

[5]  D. Lancet,et al.  A role for TENM1 mutations in congenital general anosmia , 2016, Clinical genetics.

[6]  Naomi R. Wray,et al.  Estimating Effects and Making Predictions from Genome-Wide Marker Data , 2010, 1010.4710.

[7]  C. Hoggart,et al.  Simultaneous Analysis of All SNPs in Genome-Wide and Re-Sequencing Association Studies , 2008, PLoS genetics.

[8]  Karin Meyer,et al.  Estimates of the complete genetic covariance matrix for traits in multi-trait genetic evaluation of Australian Hereford cattle , 2004 .

[9]  Alkes L. Price,et al.  New approaches to population stratification in genome-wide association studies , 2010, Nature Reviews Genetics.

[10]  William J. Astle,et al.  Population Structure and Cryptic Relatedness in Genetic Association Studies , 2009, 1010.4681.

[11]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[12]  Benjamin A. Logsdon,et al.  A variational Bayes algorithm for fast and accurate multiple locus genome-wide association analysis , 2010, BMC Bioinformatics.

[13]  Pedro Romero,et al.  Differentiation associated regulation of microRNA expression in vivo in human CD8+ T cell subsets , 2011, Journal of Translational Medicine.

[14]  M. McCarthy,et al.  Genome-wide association studies for complex traits: consensus, uncertainty and challenges , 2008, Nature Reviews Genetics.

[15]  Matti Pirinen,et al.  Efficient computation with a linear mixed model on large-scale data sets with applications to genetic studies , 2012, 1207.4886.

[16]  Runze Li,et al.  VARIABLE SELECTION IN LINEAR MIXED EFFECTS MODELS. , 2012, Annals of statistics.

[17]  Jake K. Byrnes,et al.  Genome-wide association study of copy number variation in 16,000 cases of eight common diseases and 3,000 shared controls , 2010, Nature.

[18]  Jialei Wang,et al.  Trading Interpretability for Accuracy: Oblique Treed Sparse Additive Models , 2015, KDD.

[19]  Bjarni J. Vilhjálmsson,et al.  An efficient multi-locus mixed model approach for genome-wide association studies in structured populations , 2012, Nature Genetics.

[20]  R. Narayanan,et al.  Diabetes associated genes from the dark matter of the human proteome , 2014 .

[21]  L. Tran,et al.  Integrated Systems Approach Identifies Genetic Nodes and Networks in Late-Onset Alzheimer’s Disease , 2013, Cell.

[22]  Finale Doshi-Velez,et al.  Mind the Gap: A Generative Approach to Interpretable Feature Selection and Extraction , 2015, NIPS.

[23]  P. Visscher,et al.  Common SNPs explain a large proportion of heritability for human height , 2011 .

[24]  Eleazar Eskin,et al.  Improved linear mixed models for genome-wide association studies , 2012, Nature Methods.

[25]  Catherine A Leamey,et al.  The teneurins: new players in the generation of visual topography. , 2014, Seminars in cell & developmental biology.

[26]  H. Bondell,et al.  Joint Variable Selection for Fixed and Random Effects in Linear Mixed‐Effects Models , 2010, Biometrics.

[27]  P. Visscher,et al.  Increased accuracy of artificial selection by using the realized relationship matrix. , 2009, Genetics research.

[28]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[29]  Bjarni J. Vilhjálmsson,et al.  A mixed-model approach for genome-wide association studies of correlated traits in structured populations , 2012, Nature Genetics.

[30]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[31]  H. Kang,et al.  Variance component model to account for sample structure in genome-wide association studies , 2010, Nature Genetics.

[32]  Xi Chen,et al.  Graph-Structured Multi-task Regression and an Efficient Optimization Method for General Fused Lasso , 2010, ArXiv.

[33]  P. Visscher,et al.  Five years of GWAS discovery. , 2012, American journal of human genetics.

[34]  L. Kruuk Estimating genetic parameters in natural populations using the "animal model". , 2004, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[35]  Haohan Wang,et al.  Multiple Confounders Correction with Regularized Linear Mixed Effect Models, with Application in Biological Processes , 2016, bioRxiv.

[36]  Oliver Stegle,et al.  A Lasso multi-marker mixed model for association mapping with population structure correction , 2013, Bioinform..

[37]  D. Reich,et al.  Population Structure and Eigenanalysis , 2006, PLoS genetics.

[38]  D. Heckerman,et al.  Efficient Control of Population Structure in Model Organism Association Mapping , 2008, Genetics.

[39]  Trevor J. Hastie,et al.  Genome-wide association analysis by lasso penalized logistic regression , 2009, Bioinform..

[40]  Pedro Romero,et al.  Four Functionally Distinct Populations of Human Effector-Memory CD8+ T Lymphocytes1 , 2007, The Journal of Immunology.

[41]  Xi Chen,et al.  Smoothing proximal gradient method for general structured sparse regression , 2010, The Annals of Applied Statistics.

[42]  G. Ricevuti,et al.  Alzheimer's disease, autoimmunity and inflammation. The good, the bad and the ugly. , 2011, Autoimmunity reviews.

[43]  Joy Bergelson,et al.  Source verification of mis-identified Arabidopsis thaliana accessions. , 2011, The Plant journal : for cell and molecular biology.

[44]  M. Goddard Genomic selection: prediction of accuracy and maximisation of long term response , 2009, Genetica.

[45]  Ying Liu,et al.  FaST linear mixed models for genome-wide association studies , 2011, Nature Methods.

[46]  Bonnie Berger,et al.  Efficient Bayesian mixed model analysis increases association power in large cohorts , 2014 .

[47]  Cun-Hui Zhang Nearly unbiased variable selection under minimax concave penalty , 2010, 1002.4734.