Accurate Genomic Prediction of Human Height

Hsu et al. used advanced methods from machine learning to analyze almost half a million genomes. They produced, for the first time, accurate genomic predictors for complex traits such as height, bone density, and educational attainment... We construct genomic predictors for heritable but extremely complex human quantitative traits (height, heel bone density, and educational attainment) using modern methods in high dimensional statistics (i.e., machine learning). The constructed predictors explain, respectively, ∼40, 20, and 9% of total variance for the three traits, in data not used for training. For example, predicted heights correlate ∼0.65 with actual height; actual heights of most individuals in validation samples are within a few centimeters of the prediction. The proportion of variance explained for height is comparable to the estimated common SNP heritability from genome-wide complex trait analysis (GCTA), and seems to be close to its asymptotic value (i.e., as sample size goes to infinity), suggesting that we have captured most of the heritability for SNPs. Thus, our results close the gap between prediction R-squared and common SNP heritability. The ∼20k activated SNPs in our height predictor reveal the genetic architecture of human height, at least for common variants. Our primary dataset is the UK Biobank cohort, comprised of almost 500k individual genotypes with multiple phenotypes. We also use other datasets and SNPs found in earlier genome-wide association studies (GWAS) for out-of-sample validation of our results.

[1]  P. Visscher,et al.  GCTA-GREML accounts for linkage disequilibrium when estimating genetic variance from genome-wide SNPs , 2016, Proceedings of the National Academy of Sciences.

[2]  G. de los Campos,et al.  Will Big Data Close the Missing Heritability Gap? , 2017, Genetics.

[3]  Jieping Ye,et al.  Safe Screening With Variational Inequalities and Its Applicaiton to LASSO , 2013, ICML.

[4]  Jonathan P. Beauchamp,et al.  Genome-wide association study identifies 74 loci associated with educational attainment , 2016, Nature.

[5]  Alan Edelman,et al.  Julia: A Fast Dynamic Language for Technical Computing , 2012, ArXiv.

[6]  P. Visscher,et al.  Common SNPs explain a large proportion of heritability for human height , 2011 .

[7]  K. Rawlik,et al.  Explorer Evidence for sex-specific genetic architectures across a spectrum of human complex traits , 2016 .

[8]  Chiu Man Ho,et al.  Determination of nonlinear genetic architecture using compressed sensing , 2014, GigaScience.

[9]  D. Allison,et al.  Beyond Missing Heritability: Prediction of Complex Traits , 2011, PLoS genetics.

[10]  Cédric Herzet,et al.  Safe screening tests for LASSO based on firmly non-expansiveness , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Shripad Tuljapurkar,et al.  Limitations of GCTA as a solution to the missing heritability problem , 2015, Proceedings of the National Academy of Sciences.

[12]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[13]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[14]  Kari Stefansson,et al.  Multiple genetic loci for bone mineral density and fractures. , 2008, The New England journal of medicine.

[15]  Daniel Gianola,et al.  Predicting genetic predisposition in humans: the promise of whole-genome markers , 2010, Nature Reviews Genetics.

[16]  Alexandre Gramfort,et al.  Mind the duality gap: safer rules for the Lasso , 2015, ICML.

[17]  G. Heiss The decline of ischaemic heart disease mortality in the ARIC study communities. The ARIC Study Investigators. , 1989, International journal of epidemiology.

[18]  D. Gianola,et al.  Genomic Heritability: What Is It? , 2014, PLoS genetics.

[19]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[20]  P. Visscher,et al.  GCTA: a tool for genome-wide complex trait analysis. , 2011, American journal of human genetics.

[21]  Manuel A. R. Ferreira,et al.  PLINK: a tool set for whole-genome association and population-based linkage analyses. , 2007, American journal of human genetics.

[22]  P. Visscher,et al.  10 Years of GWAS Discovery: Biology, Function, and Translation. , 2017, American journal of human genetics.

[23]  Response to “Commentary on ‘Limitations of GCTA as a solution to the missing heritability problem” , 2016 .

[24]  C. Chow,et al.  Conditions for the validity of SNP-based heritability estimation , 2014, Human Genetics.

[25]  P. Donnelly,et al.  Genome-wide genetic data on ~500,000 UK Biobank participants , 2017, bioRxiv.

[26]  R. Tibshirani,et al.  PATHWISE COORDINATE OPTIMIZATION , 2007, 0708.1485.

[27]  Laurent El Ghaoui,et al.  Safe Feature Elimination in Sparse Supervised Learning , 2010, ArXiv.

[28]  Marcelo P. Segura-Lepe,et al.  Rare and low-frequency coding variants alter human adult height , 2016, Nature.

[29]  Danny S. Park,et al.  SNP-based heritability estimation: measurement noise, population stratification, and stability , 2016, bioRxiv.

[30]  C. Chow,et al.  Applying compressed sensing to genome-wide association studies , 2014, GigaScience.

[31]  R. Plomin,et al.  Erratum: Predicting educational achievement from DNA , 2017, Molecular Psychiatry.

[32]  Carson C Chow,et al.  Second-generation PLINK: rising to the challenge of larger and richer datasets , 2014, GigaScience.

[33]  Ross M. Fraser,et al.  Defining the role of common variation in the genomic and biological architecture of adult human height , 2014, Nature Genetics.