Limitations of GCTA as a solution to the missing heritability problem

Significance The genetic contribution to a phenotype is frequently measured by heritability, the fraction of trait variation explained by genetic differences. Hundreds of publications have found DNA polymorphisms that are statistically associated with diseases or quantitative traits [genome-wide association studies (GWASs)]. Genome-wide complex trait analysis (GCTA), a recent method of analyzing such data, finds high heritabilities for such phenotypes. We analyze GCTA and show that the heritability estimates it produces are highly sensitive to the structure of the genetic relatedness matrix, to the sampling of phenotypes and subjects, and to the accuracy of phenotype measurements. Plausible modifications of the method aimed at increasing stability yield much smaller heritabilities. It is essential to reevaluate the many published heritability estimates based on GCTA. Genome-wide association studies (GWASs) seek to understand the relationship between complex phenotype(s) (e.g., height) and up to millions of single-nucleotide polymorphisms (SNPs). Early analyses of GWASs are commonly believed to have “missed” much of the additive genetic variance estimated from correlations between relatives. A more recent method, genome-wide complex trait analysis (GCTA), obtains much higher estimates of heritability using a model of random SNP effects correlated between genotypically similar individuals. GCTA has now been applied to many phenotypes from schizophrenia to scholastic achievement. However, recent studies question GCTA’s estimates of heritability. Here, we show that GCTA applied to current SNP data cannot produce reliable or stable estimates of heritability. We show first that GCTA depends sensitively on all singular values of a high-dimensional genetic relatedness matrix (GRM). When the assumptions in GCTA are satisfied exactly, we show that the heritability estimates produced by GCTA will be biased and the standard errors will likely be inaccurate. When the population is stratified, we find that GRMs typically have highly skewed singular values, and we prove that the many small singular values cannot be estimated reliably. Hence, GWAS data are necessarily overfit by GCTA which, as a result, produces high estimates of heritability. We also show that GCTA’s heritability estimates are sensitive to the chosen sample and to measurement errors in the phenotype. We illustrate our results using the Framingham dataset. Our analysis suggests that results obtained using GCTA, and the results’ qualitative interpretations, should be interpreted with great caution.

[1]  C. R. Henderson,et al.  Best linear unbiased estimation and prediction under a selection model. , 1975, Biometrics.

[2]  J. W. Silverstein,et al.  Separation of the largest eigenvalues in eigenanalysis of genotype data from discrete subpopulations. , 2013, Theoretical population biology.

[3]  Judy H. Cho,et al.  Finding the missing heritability of complex diseases , 2009, Nature.

[4]  Ralph B D'Agostino,et al.  Genetics of the Framingham Heart Study population. , 2008, Advances in genetics.

[5]  I. Johnstone High Dimensional Statistical Inference and Random Matrices , 2006, math/0611589.

[6]  J. Marron,et al.  PCA CONSISTENCY IN HIGH DIMENSION, LOW SAMPLE SIZE CONTEXT , 2009, 0911.3827.

[7]  Qiong Yang,et al.  The Third Generation Cohort of the National Heart, Lung, and Blood Institute's Framingham Heart Study: design, recruitment, and initial examination. , 2007, American journal of epidemiology.

[8]  Naomi R. Wray,et al.  Commentary on “Limitations of GCTA as a solution to the missing heritability problem” , 2016, bioRxiv.

[9]  K. Wachter The Strong Limits of Random Matrix Spectra for Sample Matrices of Independent Elements , 1978 .

[10]  Harrison H. Zhou,et al.  Estimating structured high-dimensional covariance and precision matrices: Optimal rates and adaptive estimation , 2016 .

[11]  P. Visscher,et al.  Additive genetic variation in schizophrenia risk is shared by populations of African and European descent. , 2013, American journal of human genetics.

[12]  W. Kannel,et al.  Risk stratification in hypertension: new insights from the Framingham Study. , 2000, American journal of hypertension.

[13]  P. VanRaden,et al.  Efficient methods to compute genomic predictions. , 2008, Journal of dairy science.

[14]  R. L. Quaas,et al.  Mixed Model Methodology for Farm and Ranch Beef Cattle Testing Programs , 1980 .

[15]  L. Cardon,et al.  Population stratification and spurious allelic association , 2003, The Lancet.

[16]  H. Muller The American Journal of Human Genetics Vol . 2 No . 2 June 1950 Our Load of Mutations 1 , 2006 .

[17]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[18]  I. Johnstone,et al.  Sparse Principal Components Analysis , 2009, 0901.4392.

[19]  J. Pemberton,et al.  Estimating quantitative genetic parameters in wild populations: a comparison of pedigree and genomic approaches , 2014, Molecular ecology.

[20]  P. Bickel,et al.  Covariance regularization by thresholding , 2009, 0901.3079.

[21]  Manuel A. R. Ferreira,et al.  PLINK: a tool set for whole-genome association and population-based linkage analyses. , 2007, American journal of human genetics.

[22]  David M. Evans,et al.  Genome-wide association analysis identifies 20 loci that influence adult height , 2008, Nature Genetics.

[23]  N. Rothman,et al.  Population stratification in epidemiologic studies of common genetic variants and cancer: quantification of bias. , 2000, Journal of the National Cancer Institute.

[24]  R Plomin,et al.  DNA evidence for strong genetic stability and increasing heritability of intelligence from age 7 to 12 , 2013, Molecular Psychiatry.

[25]  Daniel W. Jones,et al.  Recommendations for blood pressure measurement in humans and experimental animals: Part 1: blood pressure measurement in humans: a statement for professionals from the Subcommittee of Professional and Public Education of the American Heart Association Council on High Blood Pressure Research. , 2005, Hypertension.

[26]  P. Visscher,et al.  Quantitative trait loci (QTL) mapping of resistance to strongyles and coccidia in the free-living Soay sheep (Ovis aries). , 2007, International journal for parasitology.

[27]  P. Visscher,et al.  Estimating missing heritability for disease from genome-wide association studies. , 2011, American journal of human genetics.

[28]  Gabriel E. Hoffman,et al.  Correcting for Population Structure and Kinship Using the Linear Mixed Model: Theory and Extensions , 2013, PloS one.

[29]  G. Casella,et al.  Statistical Inference , 2003, Encyclopedia of Social Network Analysis and Mining.

[30]  W. Barendse The effect of measurement error of phenotypes on genome wide association studies , 2011, BMC Genomics.

[31]  Response to “Commentary on ‘Limitations of GCTA as a solution to the missing heritability problem” , 2016 .

[32]  V. Marčenko,et al.  DISTRIBUTION OF EIGENVALUES FOR SOME SETS OF RANDOM MATRICES , 1967 .

[33]  I. Johnstone,et al.  On Consistency and Sparsity for Principal Components Analysis in High Dimensions , 2009, Journal of the American Statistical Association.

[34]  D. Reich,et al.  Population Structure and Eigenanalysis , 2006, PLoS genetics.

[35]  P. Bickel,et al.  Regularized estimation of large covariance matrices , 2008, 0803.1909.

[36]  G. Stewart Perturbation theory for the singular value decomposition , 1990 .

[37]  P. Visscher,et al.  Common SNPs explain a large proportion of heritability for human height , 2011 .

[38]  Olivier Ledoit,et al.  Honey, I Shrunk the Sample Covariance Matrix , 2003 .

[39]  Timothy P. L. Smith,et al.  Selection and use of SNP markers for animal identification and paternity analysis in U.S. beef cattle , 2002, Mammalian Genome.

[40]  G. Robinson That BLUP is a Good Thing: The Estimation of Random Effects , 1991 .

[41]  P. Gregersen,et al.  Accounting for ancestry: population substructure and genome-wide association studies. , 2008, Human molecular genetics.

[42]  W. G. Hill,et al.  Genome partitioning of genetic variation for complex traits using common SNPs , 2011, Nature Genetics.

[43]  P. Visscher,et al.  Estimating the proportion of variation in susceptibility to schizophrenia captured by common SNPs , 2012, Nature Genetics.

[44]  D. Harville Matrix Algebra From a Statistician's Perspective , 1998 .

[45]  Benjamin D. Greenberg,et al.  Partitioning the Heritability of Tourette Syndrome and Obsessive Compulsive Disorder Reveals Differences in Genetic Architecture , 2013, PLoS genetics.

[46]  D. Balding,et al.  Relatedness in the post-genomic era: is it still useful? , 2014, Nature Reviews Genetics.

[47]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[48]  Ke Wang OPTIMAL UPPER BOUND FOR THE INFINITY NORM OF EIGENVECTORS OF RANDOM MATRICES , 2013 .

[49]  G. Pan,et al.  On asymptotics of eigenvectors of large sample covariance matrix , 2007, 0708.1720.

[50]  Stephan Ripke,et al.  Estimating the proportion of variation in susceptibility to schizophrenia captured by common SNPs , 2012, Nature Genetics.

[51]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[52]  Kathryn Roeder,et al.  REFINING GENETICALLY INFERRED RELATIONSHIPS USING TREELET COVARIANCE SMOOTHING. , 2012, The annals of applied statistics.

[53]  M McGue,et al.  Common SNPs explain some of the variation in the personality dimensions of neuroticism and extraversion , 2012, Translational Psychiatry.

[54]  Lorna M. Lopez,et al.  Genome-wide association studies establish that human intelligence is highly heritable and polygenic , 2011, Molecular Psychiatry.

[55]  M. McQueen,et al.  Is the Gene-Environment Interaction Paradigm Relevant to Genome-Wide Studies? The Case of Education and Body Mass Index , 2014, Demography.

[56]  B. Maher Personal genomes: The case of the missing heritability , 2008, Nature.