How to link call rate and p‐values for Hardy–Weinberg equilibrium as measures of genome‐wide SNP data quality

We study the link between two quality measures of SNP (single nucleotide polymorphism) data in genome‐wide association (GWA) studies, that is, per SNP call rates (CR) and p‐values for testing Hardy–Weinberg equilibrium (HWE). The aim is to improve these measures by applying methods based on realized randomized p‐values, the false discovery rate and estimates for the proportion of false hypotheses. While exact non‐randomized conditional p‐values for testing HWE cannot be recommended for estimating the proportion of false hypotheses, their realized randomized counterparts should be used. P‐values corresponding to the asymptotic unconditional chi‐square test lead to reasonable estimates only if SNPs with low minor allele frequency are excluded. We provide an algorithm to compute the probability that SNPs violate HWE given the observed CR, which yields an improved measure of data quality. The proposed methods are applied to SNP data from the KORA (Cooperative Health Research in the Region of Augsburg, Southern Germany) 500 K project, a GWA study in a population‐based sample genotyped by Affymetrix GeneChip 500 K arrays using the calling algorithm BRLMM 1.4.0. We show that all SNPs with CR = 100 per cent are nearly in perfect HWE which militates in favor of the population to meet the conditions required for HWE at least for these SNPs. Moreover, we show that the proportion of SNPs not being in HWE increases with decreasing CR. We conclude that using a single threshold for judging HWE p‐values without taking the CR into account is problematic. Instead we recommend a stratified analysis with respect to CR. Copyright © 2010 John Wiley & Sons, Ltd.

[1]  M. Kendall Statistical Methods for Research Workers , 1937, Nature.

[2]  H. Levene On a Matching Problem Arising in Genetics , 1949 .

[3]  E. Spjøtvoll,et al.  Plots of P-values to evaluate many tests simultaneously , 1982 .

[4]  E. Thompson,et al.  Performing the exact test of Hardy-Weinberg proportion for multiple alleles. , 1992, Biometrics.

[5]  B. Weir Genetic Data Analysis II. , 1997 .

[6]  John D. Storey,et al.  Empirical Bayes Analysis of a Microarray Experiment , 2001 .

[7]  B. Efron Robbins, Empirical Bayes, And Microarrays , 2001 .

[8]  David B. Allison,et al.  A mixture model approach for the analysis of microarray gene expression data , 2002 .

[9]  Jeanette C Papp,et al.  Detection and integration of genotyping errors in statistical genetics. , 2002, American journal of human genetics.

[10]  S. Wellek Tests for Establishing Compatibility of an Observed Genotype Distribution with Hardy–Weinberg Equilibrium in the Case of a Biallelic Locus , 2004, Biometrics.

[11]  John D. Storey,et al.  Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach , 2004 .

[12]  Per Broberg,et al.  A comparative review of estimates of the proportion unchanged genes and the false discovery rate , 2005, BMC Bioinformatics.

[13]  Ian Purvis,et al.  Detection of genotyping errors by Hardy–Weinberg equilibrium testing , 2004, European Journal of Human Genetics.

[14]  C. Gieger,et al.  KORA-gen - Resource for Population Genetics, Controls and a Broad Spectrum of Disease Phenotypes , 2005 .

[15]  Suzanne M Leal,et al.  Detection of genotyping errors and pseudo‐SNPs via deviations from Hardy‐Weinberg equilibrium , 2005, Genetic epidemiology.

[16]  H. Bandelt,et al.  Saami and Berbers--an unexpected mitochondrial DNA link. , 2005, American journal of human genetics.

[17]  R Holle,et al.  KORA - A Research Platform for Population Based Health Research , 2005, Gesundheitswesen (Bundesverband der Arzte des Offentlichen Gesundheitsdienstes (Germany)).

[18]  G. Abecasis,et al.  A note on exact tests of Hardy-Weinberg equilibrium. , 2005, American journal of human genetics.

[19]  BRLMM : an Improved Genotype Calling Method for the GeneChip ® Human Mapping 500 K Array Set , 2006 .

[20]  Stéphane Robin,et al.  Kerfdr: a semi-parametric kernel-based approach to local false discovery rate estimation , 2009, BMC Bioinformatics.

[21]  Helmut Finner,et al.  A note on P-values for two-sided tests. , 2007, Biometrical journal. Biometrische Zeitschrift.

[22]  Jean-Jacques Daudin,et al.  A semi-parametric approach for mixture models: Application to local false discovery rate estimation , 2007, Computational Statistics & Data Analysis.

[23]  S. Ravi Testing Statistical Hypotheses, 3rd edn by E. L. Lehmann and J. P. Romano , 2007 .

[24]  Korbinian Strimmer,et al.  A unified approach to false discovery rate estimation , 2008, BMC Bioinformatics.

[25]  C. Gieger,et al.  Variants of the PPARG, IGF2BP2, CDKAL1, HHEX, and TCF7L2 Genes Confer Risk of Type 2 Diabetes Independently of BMI in the German KORA Studies , 2008, Hormone and metabolic research = Hormon- und Stoffwechselforschung = Hormones et metabolisme.

[26]  Chun Li,et al.  Assessing departure from Hardy‐Weinberg equilibrium in the presence of disease association , 2008, Genetic epidemiology.

[27]  J. Haldane,et al.  An exact test for randomness of mating , 2008, Journal of Genetics.

[28]  B S Weir,et al.  Distributions of Hardy–Weinberg Equilibrium Test Statistics , 2008, Genetics.

[29]  A. Ziegler,et al.  Adapting the logical basis of tests for Hardy‐Weinberg Equilibrium to the real needs of association studies in human and medical genetics , 2009, Genetic epidemiology.