Multiplex Confounding Factor Correction for Genomic Association Mapping with Squared Sparse Linear Mixed Model

Genome-wide Association Study has presented a promising way to understand the association between human genomes and complex traits. Many simple polymorphic loci have been shown to explain a significant fraction of phenotypic variability. However, challenges remain in the non-triviality of explaining complex traits associated with multifactorial genetic loci, especially considering the confounding factors caused by population structure, family structure, and cryptic relatedness. In this paper, we propose a Squared-LMM (LMM2) model, aiming to jointly correct population and genetic confounding factors. We offer two strategies of utilizing LMM2 for association mapping: 1) It serves as an extension of univariate LMM, which could effectively correct population structure, but consider each SNP in isolation. 2) It is integrated with the multivariate regression model to discover association relationship between complex traits and multifactorial genetic loci. We refer to this second model as sparse Squared-LMM (sLMM2). Further, we extend LMM2/sLMM2 by raising the power of our squared model to the LMMn/sLMMn model. We demonstrate the practical use of our model with synthetic phenotypic variants generated from genetic loci of Arabidopsis Thaliana. The experiment shows that our method achieves a more accurate and significant prediction on the association relationship between traits and loci. We also evaluate our models on collected phenotypes and genotypes with the number of candidate genes that the models could discover. The results suggest the potential and promising usage of our method in genome-wide association studies.

[1]  Oliver Stegle,et al.  A Lasso multi-marker mixed model for association mapping with population structure correction , 2013, Bioinform..

[2]  P. Visscher,et al.  Conditional and joint multiple-SNP analysis of GWAS summary statistics identifies additional variants influencing complex traits , 2012, Nature Genetics.

[3]  Eric P. Xing,et al.  Select-Additive Learning: Improving Cross-individual Generalization in Multimodal Sentiment Analysis , 2016, ArXiv.

[4]  Eric P. Xing,et al.  Variable Selection in Heterogeneous Datasets: A Truncated-rank Sparse Linear Mixed Model with Applications to Genome-wide Association Studies , 2017, bioRxiv.

[5]  Miss A.O. Penney (b) , 1974, The New Yale Book of Quotations.

[6]  B. Berger,et al.  Efficient Bayesian mixed model analysis increases association power in large cohorts , 2014, Nature Genetics.

[7]  Ying Liu,et al.  FaST linear mixed models for genome-wide association studies , 2011, Nature Methods.

[8]  Muhammad Ali Amer,et al.  Genome-wide association study of 107 phenotypes in a common set of Arabidopsis thaliana inbred lines , 2010, Nature.

[9]  Jake K. Byrnes,et al.  Genome-wide association study of copy number variation in 16,000 cases of eight common diseases and 3,000 shared controls , 2010 .

[10]  Tanya Z. Berardini,et al.  The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools , 2011, Nucleic Acids Res..

[11]  Zhiwu Zhang,et al.  Mixed linear model approach adapted for genome-wide association studies , 2010, Nature Genetics.

[12]  P. Visscher,et al.  Mixed model with correction for case-control ascertainment increases association power. , 2015, American journal of human genetics.

[13]  Eric P. Xing,et al.  Multi-population GWA mapping via multi-task regularized regression , 2010, Bioinform..

[14]  C. Hoggart,et al.  Simultaneous Analysis of All SNPs in Genome-Wide and Re-Sequencing Association Studies , 2008, PLoS genetics.

[15]  Jake K. Byrnes,et al.  Genome-wide association study of copy number variation in 16,000 cases of eight common diseases and 3,000 shared controls , 2010, Nature.

[16]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[17]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[18]  D. Heckerman,et al.  Efficient Control of Population Structure in Model Organism Association Mapping , 2008, Genetics.

[19]  Seunghak Lee,et al.  Adaptive Multi-Task Lasso: with Application to eQTL Detection , 2010, NIPS.

[20]  Guifang Fu,et al.  The Bayesian lasso for genome-wide association studies , 2011, Bioinform..

[21]  M. McCarthy,et al.  Genome-wide association studies for complex traits: consensus, uncertainty and challenges , 2008, Nature Reviews Genetics.

[22]  David Heckerman,et al.  FaST-LMM-Select for addressing confounding from spatial structure and rare variants , 2013, Nature Genetics.

[23]  M. McMullen,et al.  A unified mixed-model method for association mapping that accounts for multiple levels of relatedness , 2006, Nature Genetics.

[24]  William Valdar,et al.  Simulating the Collaborative Cross: Power of Quantitative Trait Loci Detection and Mapping Resolution in Large Sets of Recombinant Inbred Strains of Mice , 2006, Genetics.

[25]  Eric P. Xing,et al.  Variable selection in heterogeneous datasets: A truncated-rank sparse linear mixed model with applications to genome-wide association studies , 2017, BIBM.

[26]  P. Visscher,et al.  Advantages and pitfalls in the application of mixed-model association methods , 2014, Nature Genetics.

[27]  Bjarni J. Vilhjálmsson,et al.  An efficient multi-locus mixed model approach for genome-wide association studies in structured populations , 2012, Nature Genetics.

[28]  Bjarni J. Vilhjálmsson,et al.  Genome-wide association study of 107 phenotypes in Arabidopsis thaliana inbred lines , 2010 .

[29]  Joy Bergelson,et al.  Source verification of mis-identified Arabidopsis thaliana accessions. , 2011, The Plant journal : for cell and molecular biology.

[30]  Bonnie Berger,et al.  Efficient Bayesian mixed model analysis increases association power in large cohorts , 2014 .

[31]  Bhiksha Raj,et al.  On the Origin of Deep Learning , 2017, ArXiv.

[32]  Haohan Wang,et al.  Multiple Confounders Correction with Regularized Linear Mixed Effect Models, with Application in Biological Processes , 2016, bioRxiv.

[33]  H. Kang,et al.  Variance component model to account for sample structure in genome-wide association studies , 2010, Nature Genetics.