Reliable heritability estimation using sparse regularization in ultrahigh dimensional genome-wide association studies

BackgroundData from genome-wide association studies (GWASs) have been used to estimate the heritability of human complex traits in recent years. Existing methods are based on the linear mixed model, with the assumption that the genetic effects are random variables, which is opposite to the fixed effect assumption embedded in the framework of quantitative genetics theory. Moreover, heritability estimators provided by existing methods may have large standard errors, which calls for the development of reliable and accurate methods to estimate heritability.ResultsIn this paper, we first investigate the influences of the fixed and random effect assumption on heritability estimation, and prove that these two assumptions are equivalent under mild conditions in the theoretical aspect. Second, we propose a two-stage strategy by first performing sparse regularization via cross-validated elastic net, and then applying variance estimation methods to construct reliable heritability estimations. Results on both simulated data and real data show that our strategy achieves a considerable reduction in the standard error while reserving the accuracy.ConclusionsThe proposed strategy allows for a reliable and accurate heritability estimation using GWAS data. It shows the promising future that reliable estimations can still be obtained with even a relatively restricted sample size, and should be especially useful for large-scale heritability analyses in the genomics era.

[1]  N. Wray,et al.  Contrasting genetic architectures of schizophrenia and other complex diseases using fast variance components analysis , 2015, Nature Genetics.

[2]  David M. Evans,et al.  Genome-wide association analysis identifies 20 loci that influence adult height , 2008, Nature Genetics.

[3]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[4]  Can Yang,et al.  On high-dimensional misspecified mixed model analysis in genome-wide association study , 2016 .

[5]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[6]  T. Hohls,et al.  Setting confidence limits to genetic parameters estimated by restricted maximum likelihood analysis of North Carolina design II experiments , 1996, Heredity.

[7]  Lee H. Dicker,et al.  Variance estimation in high-dimensional linear models , 2014 .

[8]  H. D. Patterson,et al.  Recovery of inter-block information when block sizes are unequal , 1971 .

[9]  Peng Zhao,et al.  On Model Selection Consistency of Lasso , 2006, J. Mach. Learn. Res..

[10]  Thomas Bourgeron,et al.  Improving heritability estimation by a variable selection approach in sparse high dimensional linear mixed models , 2015, 1507.06245.

[11]  Shripad Tuljapurkar,et al.  Correction for Krishna Kumar et al., Limitations of GCTA as a solution to the missing heritability problem , 2016, Proceedings of the National Academy of Sciences.

[12]  Greg Gibson,et al.  Rare and common variants: twenty arguments , 2012, Nature Reviews Genetics.

[13]  Judy H. Cho,et al.  Finding the missing heritability of complex diseases , 2009, Nature.

[14]  Carson C Chow,et al.  Second-generation PLINK: rising to the challenge of larger and richer datasets , 2014, GigaScience.

[15]  Cun-Hui Zhang,et al.  Scaled sparse linear regression , 2011, 1104.4595.

[16]  P. Visscher,et al.  Simultaneous Discovery, Estimation and Prediction Analysis of Complex Traits Using a Bayesian Mixture Model , 2015, PLoS genetics.

[17]  N. Meinshausen,et al.  LASSO-TYPE RECOVERY OF SPARSE REPRESENTATIONS FOR HIGH-DIMENSIONAL DATA , 2008, 0806.0145.

[18]  Dinggang Shen,et al.  Mapping the Genetic Variation of Regional Brain Volumes as Explained by All Common SNPs from the ADNI Study , 2013, PloS one.

[19]  Peter Kraft,et al.  Bayesian inference analyses of the polygenic architecture of rheumatoid arthritis , 2012, Nature Genetics.

[20]  Thomas E. Nichols,et al.  Massively expedited genome-wide heritability analysis (MEGHA) , 2015, Proceedings of the National Academy of Sciences.

[21]  Anders M. Dale,et al.  An automated labeling system for subdividing the human cerebral cortex on MRI scans into gyral based regions of interest , 2006, NeuroImage.

[22]  Michael R. Johnson,et al.  Re-evaluation of SNP heritability in complex human traits , 2016, Nature Genetics.

[23]  P. Visscher,et al.  Common SNPs explain a large proportion of heritability for human height , 2011 .

[24]  A. E. Hoerl,et al.  Ridge Regression: Applications to Nonorthogonal Problems , 1970 .

[25]  A. E. Hoerl,et al.  Ridge regression: biased estimation for nonorthogonal problems , 2000 .

[26]  Bjarni V. Halldórsson,et al.  Many sequence variants affecting diversity of adult human height , 2008, Nature Genetics.

[27]  Yaoliang Yu,et al.  Generalized Conditional Gradient for Sparse Estimation , 2014, J. Mach. Learn. Res..

[28]  G J Barker,et al.  Genomic architecture of human neuroanatomical diversity , 2013, Molecular Psychiatry.

[29]  Ross M. Fraser,et al.  Defining the role of common variation in the genomic and biological architecture of adult human height , 2014, Nature Genetics.

[30]  Shripad Tuljapurkar,et al.  Limitations of GCTA as a solution to the missing heritability problem , 2015, Proceedings of the National Academy of Sciences.

[31]  D. Falconer,et al.  Introduction to Quantitative Genetics. , 1962 .

[32]  P. Visscher,et al.  Estimation and partition of heritability in human populations using whole-genome analysis methods. , 2013, Annual review of genetics.

[33]  Anthony C Davison,et al.  Efficient inference for genetic association studies with multiple outcomes , 2016, Biostatistics.

[34]  M. Lynch,et al.  Genetics and Analysis of Quantitative Traits , 1996 .

[35]  C. Spencer,et al.  Biological Insights From 108 Schizophrenia-Associated Genetic Loci , 2014, Nature.

[36]  P. Visscher,et al.  GCTA: a tool for genome-wide complex trait analysis. , 2011, American journal of human genetics.

[37]  B. Maher Personal genomes: The case of the missing heritability , 2008, Nature.

[38]  Jianqing Fan,et al.  Variance estimation using refitted cross‐validation in ultrahigh dimensional regression , 2010, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[39]  Eric Boerwinkle,et al.  Rare variants analysis using penalization methods for whole genome sequence data , 2015, BMC Bioinformatics.

[40]  Gilles Louppe,et al.  Exploiting SNP Correlations within Random Forest for Genome-Wide Association Studies , 2014, PloS one.

[41]  Manuel A. R. Ferreira,et al.  Assumption-Free Estimation of Heritability from Genome-Wide Identity-by-Descent Sharing between Full Siblings , 2006, PLoS genetics.