Simultaneous genotype calling and haplotype phasing improves genotype accuracy and reduces false-positive associations for genome-wide association studies.

We present a novel method for simultaneous genotype calling and haplotype-phase inference. Our method employs the computationally efficient BEAGLE haplotype-frequency model, which can be applied to large-scale studies with millions of markers and thousands of samples. We compare genotype calls made with our method to genotype calls made with the BIRDSEED, CHIAMO, GenCall, and ILLUMINUS genotype-calling methods, using genotype data from the Illumina 550K and Affymetrix 500K arrays. We show that our method has higher genotype-call accuracy and yields fewer uncalled genotypes than competing methods. We perform single-marker analysis of data from the Wellcome Trust Case Control Consortium bipolar disorder and type 2 diabetes studies. For bipolar disorder, the genotype calls in the original study yield 25 markers with apparent false-positive association with bipolar disorder at a p < 10(-7) significance level, whereas genotype calls made with our method yield no associated markers at this significance threshold. Conversely, for markers with replicated association with type 2 diabetes, there is good concordance between genotype calls used in the original study and calls made by our method. Results from single-marker and haplotypic analysis of our method's genotype calls for the bipolar disorder study indicate that our method is highly effective at eliminating genotyping artifacts that cause false-positive associations in genome-wide association studies. Our new genotype-calling methods are implemented in the BEAGLE and BEAGLECALL software packages.

[1]  M. McCarthy,et al.  Replication of Genome-Wide Association Signals in UK Samples Reveals Risk Loci for Type 2 Diabetes , 2007, Science.

[2]  J. Long,et al.  An E-M algorithm and testing strategy for multiple-locus haplotypes. , 1995, American journal of human genetics.

[3]  B. Browning,et al.  Haplotypic analysis of Wellcome Trust Case Control Consortium data , 2008, Human Genetics.

[4]  Joshua M. Korn,et al.  Integrated genotype calling and association analysis of SNPs, common copy number polymorphisms and rare CNVs , 2008, Nature Genetics.

[5]  D. Clayton,et al.  A Method to Address Differential Bias in Genotyping in Large-Scale Association Studies , 2007, PLoS genetics.

[6]  Ion I. Mandoiu,et al.  Genotype Error Detection Using Hidden Markov Models of Haplotype Diversity , 2007, WABI.

[7]  C. Power,et al.  Cohort profile: 1958 British birth cohort (National Child Development Study). , 2006, International journal of epidemiology.

[8]  L. Excoffier,et al.  Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. , 1995, Molecular biology and evolution.

[9]  K. Mossman The Wellcome Trust Case Control Consortium, U.K. , 2008 .

[10]  H. Muller The American Journal of Human Genetics Vol . 2 No . 2 June 1950 Our Load of Mutations 1 , 2006 .

[11]  B. Browning,et al.  A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. , 2009, American journal of human genetics.

[12]  Robert C. Thompson,et al.  Genome-wide association and meta-analysis of bipolar disorder in individuals of European ancestry , 2009, Proceedings of the National Academy of Sciences.

[13]  Rafael A Irizarry,et al.  Exploration, normalization, and genotype calls of high-density oligonucleotide SNP array data. , 2006, Biostatistics.

[14]  Terence P. Speed,et al.  Genome analysis A genotype calling algorithm for affymetrix SNP arrays , 2005 .

[15]  M. McCarthy,et al.  Meta-analysis of genome-wide association data and large-scale replication identifies additional susceptibility loci for type 2 diabetes , 2008, Nature Genetics.

[16]  Ann B. Lee,et al.  On the use of general control samples for genome-wide association studies: genetic matching highlights causal variants. , 2008, American journal of human genetics.

[17]  Zhaohui S. Qin,et al.  A second generation human haplotype map of over 3.1 million SNPs , 2007, Nature.

[18]  P. Donnelly,et al.  A new multipoint method for genome-wide association studies by imputation of genotypes , 2007, Nature Genetics.

[19]  Zhaoxia Yu,et al.  Genotype determination for polymorphisms in linkage disequilibrium , 2008, BMC Bioinformatics.

[20]  B. Browning,et al.  Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. , 2007, American journal of human genetics.

[21]  S. Gabriel,et al.  Risk alleles for multiple sclerosis identified by a genomewide study. , 2007, The New England journal of medicine.

[22]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[23]  Brian L. Browning,et al.  PRESTO: Rapid calculation of order statistic distributions and multiple-testing adjusted P-values via permutation for one and two-stage genetic association studies , 2008, BMC Bioinformatics.

[24]  Simon C. Potter,et al.  Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls , 2007, Nature.

[25]  Terence P. Speed,et al.  A comparison of normalization methods for high density oligonucleotide array data based on variance and bias , 2003, Bioinform..

[26]  Hannu Toivonen,et al.  HaploRec: efficient and accurate large-scale reconstruction of haplotypes , 2006, BMC Bioinformatics.

[27]  Sharon R. Browning,et al.  Missing data imputation and haplotype phase inference for genome-wide association studies , 2008, Human Genetics.

[28]  B. Browning,et al.  Efficient multilocus association testing for whole genome association studies using localized haplotype clustering , 2007, Genetic epidemiology.

[29]  R. A. Bailey,et al.  Robust associations of four new chromosome regions from genome-wide analyses of type 1 diabetes , 2007, Nature Genetics.

[30]  T. Frayling Genome–wide association studies provide new insights into type 2 diabetes aetiology , 2007, Nature Reviews Genetics.

[31]  Michael Inouye,et al.  A genotype calling algorithm for the Illumina BeadArray platform , 2007, Bioinform..

[32]  Paul Scheet,et al.  Linkage Disequilibrium-Based Quality Control for Large-Scale Genetic Studies , 2008, PLoS genetics.

[33]  Judy H. Cho,et al.  Genome-wide association defines more than 30 distinct susceptibility loci for Crohn's disease , 2008, Nature Genetics.

[34]  Hongyu Zhao,et al.  Genotyping and inflated type I error rate in genome-wide association case/control studies , 2009, BMC Bioinformatics.

[35]  Tianhua Niu,et al.  Incorporating genotyping uncertainty in haplotype inference for single-nucleotide polymorphisms. , 2004, American journal of human genetics.

[36]  G. Abecasis,et al.  A note on exact tests of Hardy-Weinberg equilibrium. , 2005, American journal of human genetics.

[37]  Zhaohui S. Qin,et al.  A comparison of phasing algorithms for trios and unrelated individuals. , 2006, American journal of human genetics.

[38]  C. Gieger,et al.  Genomewide association analysis of coronary artery disease. , 2007, The New England journal of medicine.

[39]  K. Kidd,et al.  HAPLO: a program using the EM algorithm to estimate the frequencies of multi-site haplotypes. , 1995, The Journal of heredity.

[40]  D. Clayton,et al.  Population structure, differential bias and genomic control in a large-scale, case-control association study , 2005, Nature Genetics.

[41]  Judy H Cho,et al.  Genome-wide association study identifies new susceptibility loci for Crohn disease and implicates autophagy in disease pathogenesis , 2007, Nature Genetics.