A Lasso multi-marker mixed model for association mapping with population structure correction

MOTIVATION Exploring the genetic basis of heritable traits remains one of the central challenges in biomedical research. In traits with simple Mendelian architectures, single polymorphic loci explain a significant fraction of the phenotypic variability. However, many traits of interest seem to be subject to multifactorial control by groups of genetic loci. Accurate detection of such multivariate associations is non-trivial and often compromised by limited statistical power. At the same time, confounding influences, such as population structure, cause spurious association signals that result in false-positive findings. RESULTS We propose linear mixed models LMM-Lasso, a mixed model that allows for both multi-locus mapping and correction for confounding effects. Our approach is simple and free of tuning parameters; it effectively controls for population structure and scales to genome-wide datasets. LMM-Lasso simultaneously discovers likely causal variants and allows for multi-marker-based phenotype prediction from genotype. We demonstrate the practical use of LMM-Lasso in genome-wide association studies in Arabidopsis thaliana and linkage mapping in mouse, where our method achieves significantly more accurate phenotype prediction for 91% of the considered phenotypes. At the same time, our model dissects the phenotypic variability into components that result from individual single nucleotide polymorphism effects and population structure. Enrichment of known candidate genes suggests that the individual associations retrieved by LMM-Lasso are likely to be genuine. AVAILABILITY Code available under http://webdav.tuebingen. mpg.de/u/karsten/Forschung/research.html. CONTACT rakitsch@tuebingen.mpg.de, ippert@microsoft.com or stegle@ebi.ac.uk SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  P. Visscher,et al.  Increased accuracy of artificial selection by using the realized relationship matrix. , 2009, Genetics research.

[2]  Martin S. Taylor,et al.  Genome-wide genetic association of complex traits in heterogeneous stock mice , 2006, Nature Genetics.

[3]  Jonathan Flint,et al.  Genetic architecture of quantitative traits in mice, flies, and humans. , 2009, Genome research.

[4]  Bjarni J. Vilhjálmsson,et al.  An efficient multi-locus mixed model approach for genome-wide association studies in structured populations , 2012, Nature Genetics.

[5]  Ying Liu,et al.  FaST linear mixed models for genome-wide association studies , 2011, Nature Methods.

[6]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[7]  Guifang Fu,et al.  The Bayesian lasso for genome-wide association studies , 2011, Bioinform..

[8]  P. Bühlmann,et al.  Estimation for High‐Dimensional Linear Mixed‐Effects Models Using ℓ1‐Penalization , 2010, 1002.3784.

[9]  Bjarni J. Vilhjálmsson,et al.  Genome-wide association study of 107 phenotypes in Arabidopsis thaliana inbred lines , 2010 .

[10]  N J Cox,et al.  The importance of genealogy in determining genetic associations with complex traits. , 2001, American journal of human genetics.

[11]  E. Xing,et al.  Statistical Estimation of Correlated Genome Associations to a Quantitative Trait Network , 2009, PLoS genetics.

[12]  Neil D. Lawrence,et al.  Joint Modelling of Confounding Factors and Prominent Genetic Regulators Provides Increased Accuracy in Genetical Genomics Studies , 2012, PLoS Comput. Biol..

[13]  Zhiwu Zhang,et al.  Mixed linear model approach adapted for genome-wide association studies , 2010, Nature Genetics.

[14]  Naomi R. Wray,et al.  Estimating Effects and Making Predictions from Genome-Wide Marker Data , 2010, 1010.4710.

[15]  N. Meinshausen,et al.  Stability selection , 2008, 0809.2932.

[16]  Peter Bühlmann,et al.  p-Values for High-Dimensional Regression , 2008, 0811.2177.

[17]  Muhammad Ali Amer,et al.  Genome-wide association study of 107 phenotypes in a common set of Arabidopsis thaliana inbred lines , 2010, Nature.

[18]  Hao Xu,et al.  Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries , 2011, NIPS.

[19]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[20]  C. Hoggart,et al.  Simultaneous Analysis of All SNPs in Genome-Wide and Re-Sequencing Association Studies , 2008, PLoS genetics.

[21]  Alkes L. Price,et al.  New approaches to population stratification in genome-wide association studies , 2010, Nature Reviews Genetics.

[22]  A. Auton,et al.  Genome-wide patterns of genetic variation in worldwide Arabidopsis thaliana accessions from the RegMap panel , 2011, Nature Genetics.

[23]  P. Visscher,et al.  Conditional and joint multiple-SNP analysis of GWAS summary statistics identifies additional variants influencing complex traits , 2012, Nature Genetics.

[24]  E. Stone,et al.  The genetics of quantitative traits: challenges and prospects , 2009, Nature Reviews Genetics.

[25]  D. Heckerman,et al.  Efficient Control of Population Structure in Model Organism Association Mapping , 2008, Genetics.

[26]  Tso-Jung Yen,et al.  Discussion on "Stability Selection" by Meinshausen and Buhlmann , 2010 .

[27]  Keyan Zhao,et al.  An Arabidopsis Example of Association Mapping in Structured Samples , 2006, PLoS genetics.

[28]  David Heckerman,et al.  Correction for hidden confounders in the genetic analysis of gene expression , 2010, Proceedings of the National Academy of Sciences.

[29]  M. McCarthy,et al.  Genome-wide association studies for complex traits: consensus, uncertainty and challenges , 2008, Nature Reviews Genetics.

[30]  M. McMullen,et al.  A unified mixed-model method for association mapping that accounts for multiple levels of relatedness , 2006, Nature Genetics.

[31]  Scott D. Foster,et al.  Incorporating LASSO effects into a mixed model for quantitative trait loci detection , 2007 .

[32]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[33]  Jake K. Byrnes,et al.  Genome-wide association study of copy number variation in 16,000 cases of eight common diseases and 3,000 shared controls , 2010, Nature.

[34]  Detlef Weigel,et al.  The Scale of Population Structure in Arabidopsis thaliana , 2010, PLoS genetics.

[35]  Peter Buhlmann Statistical significance in high-dimensional linear models , 2012, 1202.1377.

[36]  R. Sakia The Box-Cox transformation technique: a review , 1992 .

[37]  Trevor J. Hastie,et al.  Genome-wide association analysis by lasso penalized logistic regression , 2009, Bioinform..

[38]  M. Nordborg,et al.  Conditions Under Which Genome-Wide Association Studies Will be Positively Misleading , 2010, Genetics.

[39]  Seunghak Lee,et al.  Leveraging input and output structures for joint mapping of epistatic and marginal eQTLs , 2012, Bioinform..

[40]  M. Stephens,et al.  Genome-wide Efficient Mixed Model Analysis for Association Studies , 2012, Nature Genetics.

[41]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[42]  Eric P. Xing,et al.  Multi-population GWA mapping via multi-task regularized regression , 2010, Bioinform..

[43]  G. Robinson That BLUP is a Good Thing: The Estimation of Random Effects , 1991 .

[44]  P. Visscher,et al.  Common SNPs explain a large proportion of heritability for human height , 2011 .

[45]  Joseph K. Bradley,et al.  Parallel Coordinate Descent for L1-Regularized Loss Minimization , 2011, ICML.

[46]  M. Spitz,et al.  Shifting paradigm of association studies: value of rare single-nucleotide polymorphisms. , 2008, American journal of human genetics.

[47]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[48]  H. Kang,et al.  Variance component model to account for sample structure in genome-wide association studies , 2010, Nature Genetics.

[49]  Ying Wang,et al.  Genomewide association study of leprosy. , 2009, The New England journal of medicine.

[50]  Daniel Gianola,et al.  Using Whole-Genome Sequence Data to Predict Quantitative Trait Phenotypes in Drosophila melanogaster , 2012, PLoS genetics.