Improved linear mixed models for genome-wide association studies

to determine these similarities1. Here, however, we show theoretically and experimentally that carefully selecting a small number of SNPs systematically increases power (that is, it jointly reduces false positives and false negatives), improves calibration (lessens inflation or deflation of the test statistic) and reduces computational cost. Our approach is motivated by two considerations. First, an LMM with no fixed effects using genetic similarities constructed from a set of SNPs is mathematically equivalent to a linear regression of the SNPs on the phenotype (with weights integrated over independent normal distributions having the same variance—in particular, the genetic variance)3. That is, an LMM using a given set of SNPs for genetic similarity is equivalent to (Bayesian) linear regression using those SNPs as covariates to correct for confounding. In theory, this equivalence holds only for certain forms of genetic similarity matrices, such as the realized relationship matrix2,3. In practice, however, the realized relationship matrix and other measures of similarity, such as identity by state1, yield very similar measures of association (Supplementary Note 1), and thus our demonstration is quite general. Second, regardless of the form of regression used for GWAS, the significance of SNP-phenotype association should be determined by conditioning on exactly those SNPs that are associated with the phenotype. These SNPs include causal SNPs, or those nearby that tag causal SNPs, and SNPs that are associated by way of confounding (for example, because of population structure). By conditioning on causal or tagging SNPs, we reduce the noise in the assessment of the association4. By conditioning on SNPs associated because of confounding, we control for such confounding5. Moreover, if a SNP is unrelated to the phenotype, it should not be in the conditioning set. In the particular case in which we use Bayesian linear regression for GWAS, the inclusion of unrelated SNPs in the genetic similarity matrix decreases the relative influence of each SNP on the phenotype (because all SNP weights share the same prior distribution whose variance—the genetic variance in the LMM view—is estimated from the data). The decrease in influence leads to incomplete correction for confounding and hence inflated test statistics and reduced power. We refer to this phenomenon as ‘dilution.’ To identify SNPs that satisfy these principles, we developed a simple heuristic that yields improved power and calibration. First, we order SNPs by their linear-regression P values from lowest to highest. Then we construct genetic similarity matrices with an increasing number of SNPs as previously ordered until we find the first minimum in lGC (the genomic control factor). In practice, the number of SNPs selected is typically smaller than the number of individuals analyzed, a condition that can be exploited by an existing algorithm, FaST-LMM, to yield large computational savings2. The equivalence between the LMM and Bayesian linear regression also implies that, when a given SNP is being tested, that SNP should be excluded from the computation of genetic similarity to avoid using it as a covariate. Including the SNP would make the log likelihood of the null model higher than it should be and lead to deflation of the test statistic and loss of power. We call this phenomenon ‘proximal contamination’. In addition to the SNP being tested, we also exclude those SNPs in close proximity (for example, within 2 centimorgans), as linkage disequilibrium will lead to a similar deflation and loss of power. A naive algorithm for excluding these from the similarity matrix is computationally expensive, so we developed a speedup (Supplementary Note 2). Together, the linear-regression scan to select SNPs for inclusion in the matrix and Supplementary Table 4). Many proteins were either overrepresented or underrepresented in each of the protease data sets, and clustering showed that enzyme specificity had the most influence on the results. Some examples within the top 1,000 proteins showed that for specific proteins, one protease outperformed all the others (Fig. 1c and Supplementary Fig. 3). Our data demonstrated that quantitation based on both spectral counting and peptide intensity was indeed biased when solely relying on a single protease, and this bias affected even the most abundant proteins, sometimes by more than a factor of 1,000. Amino acid analysis revealed that proteins overrepresented in a data set obtained by a particular protease contained relatively more cleavage-specific residues for that protease (Supplementary Fig. 3). Our data stresses that the best proteotypic peptides are not necessarily tryptic, a finding that may affect other quantitative assays such as selected reaction monitoring as well. Raw and processed mass spectrometry identification data are available through thegpm.org at ftp://ftp.proteomecentral.org/ public/0/ice.0.e.

[1]  C. Hoggart,et al.  Simultaneous Analysis of All SNPs in Genome-Wide and Re-Sequencing Association Studies , 2008, PLoS genetics.

[2]  David Reich,et al.  Discerning the Ancestry of European Americans in Genetic Association Studies , 2007, PLoS genetics.

[3]  Detlef Weigel,et al.  Recombination and linkage disequilibrium in Arabidopsis thaliana , 2007, Nature Genetics.

[4]  Simon C. Potter,et al.  Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls , 2007, Nature.

[5]  H. Kang,et al.  Variance component model to account for sample structure in genome-wide association studies , 2010, Nature Genetics.

[6]  D. Heckerman,et al.  Efficient Control of Population Structure in Model Organism Association Mapping , 2008, Genetics.

[7]  Bjarni J. Vilhjálmsson,et al.  Genome-wide association study of 107 phenotypes in Arabidopsis thaliana inbred lines , 2010 .

[8]  Gabriel Silva,et al.  An ancestry informative marker set for determining continental origin: validation and extension using human genome diversity panels , 2009, BMC Genetics.

[9]  Simon Cawley,et al.  Description of the data from the Collaborative Study on the Genetics of Alcoholism (COGA) and single-nucleotide polymorphism genotyping for Genetic Analysis Workshop 14 , 2005, BMC Genetics.

[10]  Detlef Weigel,et al.  The Scale of Population Structure in Arabidopsis thaliana , 2010, PLoS genetics.

[11]  Victor H Hernandez,et al.  Nature Methods , 2007 .

[12]  P. Rantakallio,et al.  Groups at risk in low birth weight infants and perinatal mortality. , 1969, Acta paediatrica Scandinavica.

[13]  Ying Liu,et al.  FaST linear mixed models for genome-wide association studies , 2011, Nature Methods.

[14]  Francisco M De La Vega,et al.  Analyses of a set of 128 ancestry informative single-nucleotide polymorphisms in a global set of 119 population samples , 2011, Investigative Genetics.

[15]  D. Balding A tutorial on statistical methods for population association studies , 2006, Nature Reviews Genetics.

[16]  David J Balding,et al.  Logistic regression protects against population structure in genetic association studies. , 2005, Genome research.

[17]  K. Mossman The Wellcome Trust Case Control Consortium, U.K. , 2008 .

[18]  K. Roeder,et al.  Genomic Control for Association Studies , 1999, Biometrics.

[19]  William J. Astle,et al.  Population Structure and Cryptic Relatedness in Genetic Association Studies , 2009, 1010.4681.

[20]  C. Hoggart,et al.  Genome-wide association analysis of metabolic traits in a birth cohort from a founder population , 2008, Nature Genetics.

[21]  P. Visscher,et al.  Increased accuracy of artificial selection by using the realized relationship matrix. , 2009, Genetics research.

[22]  D. Mash,et al.  GABAergic Gene Expression in Postmortem Hippocampus from Alcoholics and Cocaine Addicts; Corresponding Findings in Alcohol-Naïve P and NP Rats , 2012, PloS one.