Cost–Effective Prediction of Gender-Labeling Errors and Estimation of Gender-Labeling Error Rates in Candidate-Gene Association Studies

We describe a statistical approach to predict gender-labeling errors in candidate-gene association studies, when Y-chromosome markers have not been included in the genotyping set. The approach adds value to methods that consider only the heterozygosity of X-chromosome SNPs, by incorporating available information about the intensity of X-chromosome SNPs in candidate genes relative to autosomal SNPs from the same individual. To our knowledge, no published methods formalize a framework in which heterozygosity and relative intensity are simultaneously taken into account. Our method offers the advantage that, in the genotyping set, no additional space is required beyond that already assigned to X-chromosome SNPs in the candidate genes. We also show how the predictions can be used in a two-phase sampling design to estimate the gender-labeling error rates for an entire study, at a fraction of the cost of a conventional design.

[1]  J. Neyman Contribution to the Theory of Sampling Human Populations , 1938 .

[2]  K C Cain,et al.  Logistic regression analysis and efficient design for two-stage studies. , 1988, American journal of epidemiology.

[3]  P. Gill,et al.  A rapid and quantitative DNA sex test: fluorescence-based PCR analysis of X-Y homologous gene amelogenin. , 1993, BioTechniques.

[4]  Margaret S. Pepe,et al.  A mean score method for missing and auxiliary covariate data in regression models , 1995 .

[5]  L. Bruni,et al.  A novel double nucleotide substitution in the HMG box of the SRY gene associated with Swyer syndrome , 1997, Human Genetics.

[6]  Nilanjan Chatterjee,et al.  Design and analysis of two‐phase studies with binary outcome applied to Wilms tumour prognosis , 1999 .

[7]  Robert D. Tortora,et al.  Sampling: Design and Analysis , 2000 .

[8]  A. Oliphant,et al.  BeadArray technology: enabling an accurate, cost-effective approach to high-throughput genotyping. , 2002, BioTechniques.

[9]  Thomas Lumley,et al.  Analysis of Complex Survey Samples , 2004 .

[10]  L. Feuk,et al.  Detection of large-scale variation in the human genome , 2004, Nature Genetics.

[11]  S. Wood Generalized Additive Models: An Introduction with R , 2006 .

[12]  Alan Y. Chiang,et al.  Generalized Additive Models: An Introduction With R , 2007, Technometrics.

[13]  Manuel A. R. Ferreira,et al.  PLINK: a tool set for whole-genome association and population-based linkage analyses. , 2007, American journal of human genetics.

[14]  R. Gascoyne,et al.  Organochlorines and risk of non‐Hodgkin lymphoma , 2007, International journal of cancer.

[15]  S. Wood,et al.  Generalized Additive Models: An Introduction with R , 2006 .

[16]  Lars Bolund,et al.  Building the sequence map of the human pan-genome , 2010, Nature Biotechnology.

[17]  Jürgen Brockmöller,et al.  Amelogenin-based sex identification as a strategy to control the identity of DNA samples in genetic association studies. , 2010, Pharmacogenomics.

[18]  Marylyn D. Ritchie,et al.  Finding Unique Filter Sets in PLATO: A Precursor to Efficient Interaction Analysis in GWAS Data , 2010, Pacific Symposium on Biocomputing.