Incorporating genotyping uncertainty in haplotype inference for single-nucleotide polymorphisms.

The accuracy of the vast amount of genotypic information generated by high-throughput genotyping technologies is crucial in haplotype analyses and linkage-disequilibrium mapping for complex diseases. To date, most automated programs lack quality measures for the allele calls; therefore, human interventions, which are both labor intensive and error prone, have to be performed. Here, we propose a novel genotype clustering algorithm, GeneScore, based on a bivariate t-mixture model, which assigns a set of probabilities for each data point belonging to the candidate genotype clusters. Furthermore, we describe an expectation-maximization (EM) algorithm for haplotype phasing, GenoSpectrum (GS)-EM, which can use probabilistic multilocus genotype matrices (called "GenoSpectrum") as inputs. Combining these two model-based algorithms, we can perform haplotype inference directly on raw readouts from a genotyping machine, such as the TaqMan assay. By using both simulated and real data sets, we demonstrate the advantages of our probabilistic approach over the current genotype scoring methods, in terms of both the accuracy of haplotype inference and the statistical power of haplotype-based association analyses.

[1]  J. Long,et al.  An E-M algorithm and testing strategy for multiple-locus haplotypes. , 1995, American journal of human genetics.

[2]  N. Risch Searching for genetic determinants in the new millennium , 2000, Nature.

[3]  Zhaohui S. Qin,et al.  Bayesian haplotype inference for multiple linked single-nucleotide polymorphisms. , 2002, American journal of human genetics.

[4]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[5]  Stacey S Cherny,et al.  The impact of genotyping error on family-based analysis of quantitative traits , 2001, European Journal of Human Genetics.

[6]  M. Xiong,et al.  The effect that genotyping errors have on the robustness of common linkage-disequilibrium measures. , 2001, American journal of human genetics.

[7]  John A. Todd,et al.  Towards fully automated genome–wide polymorphism screening , 1995, Nature Genetics.

[8]  Katherine M Kirk,et al.  The impact of genotyping error on haplotype reconstruction and frequency estimation , 2002, European Journal of Human Genetics.

[9]  Zhaohui S. Qin,et al.  Partition-ligation-expectation-maximization algorithm for haplotype inference with single-nucleotide polymorphisms. , 2002, American journal of human genetics.

[10]  K H Buetow,et al.  Influence of aberrant observations on high-resolution linkage analysis outcomes. , 1991, American journal of human genetics.

[11]  Jeanette C Papp,et al.  Detection and integration of genotyping errors in statistical genetics. , 2002, American journal of human genetics.

[12]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[13]  A. Clark,et al.  Inference of haplotypes from PCR-amplified samples of diploid populations. , 1990, Molecular biology and evolution.

[14]  Michael Boehnke,et al.  Probability of detection of genotyping errors and mutations as inheritance inconsistencies in nuclear-family data. , 2002, American journal of human genetics.

[15]  Yanfa Yan,et al.  Alloys: Atomic structure of the quasicrystal Al72Ni20Co8 , 2000, Nature.

[16]  G. Colditz,et al.  A Prospective Study of XRCC 1 Haplotypes and Their Interaction with Plasma Carotenoids on Breast Cancer Risk , 2003 .

[17]  A. Chakravarti,et al.  Haplotype inference in random population samples. , 2002, American journal of human genetics.

[18]  H. Li,et al.  A permutation procedure for the haplotype method for identification of disease‐predisposing variants , 2001 .

[19]  Xiao-Li Meng,et al.  The Art of Data Augmentation , 2001 .

[20]  Jun S. Liu,et al.  Parameter Expansion for Data Augmentation , 1999 .

[21]  S. Tishkoff,et al.  Molecular haplotyping of genetic markers 10 kb apart by allele-specific long-range PCR. , 1996, Nucleic acids research.

[22]  T P Speed,et al.  The effects of genotyping errors and interference on estimation of genetic distance. , 1997, Human heredity.

[23]  P. Donnelly,et al.  A new statistical method for haplotype reconstruction from population data. , 2001, American journal of human genetics.

[24]  S. Grant,et al.  SNP genotyping on a genome-wide amplified DOP-PCR template. , 2002, Nucleic acids research.

[25]  K Lange,et al.  A multipoint method for detecting genotyping errors and mutations in sibling-pair linkage data. , 2000, American journal of human genetics.

[26]  James R. Eshleman,et al.  Conversion of diploidy to haploidy , 2000, Nature.

[27]  K. Kidd,et al.  HAPLO: a program using the EM algorithm to estimate the frequencies of multi-site haplotypes. , 1995, The Journal of heredity.

[28]  N Risch,et al.  High-throughput genotyping with single nucleotide polymorphisms. , 2001, Genome research.

[29]  D. Fallin,et al.  Angiotensinogen Gene Haplotype and Hypertension: Interaction With ACE Gene I Allele , 2003, Hypertension.

[30]  M. Stephens Dealing with label switching in mixture models , 2000 .

[31]  R. Lewontin The Interaction of Selection and Linkage. I. General Considerations; Heterotic Models. , 1964, Genetics.

[32]  L. Excoffier,et al.  Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. , 1995, Molecular biology and evolution.

[33]  Henk Neefs,et al.  High-throughput genotyping of single nucleotide polymorphisms using new biplex invader technology. , 2002, Nucleic acids research.

[34]  N. Schork,et al.  Genetic analysis of case/control data using estimated haplotype frequencies: application to APOE locus variation and Alzheimer's disease. , 2001, Genome research.

[35]  F J McMahon,et al.  Utility and accuracy of template-directed dye-terminator incorporation with fluorescence-polarization detection for genotyping single nucleotide polymorphisms. , 2002, BioTechniques.

[36]  Hongyu Zhao,et al.  Haplotypes at the OPRM1 locus are associated with susceptibility to substance dependence in European‐Americans , 2003, American journal of medical genetics. Part B, Neuropsychiatric genetics : the official publication of the International Society of Psychiatric Genetics.

[37]  D Curtis,et al.  Assessing Optimal Neural Network Architecture for Identifying Disease‐associated Multi‐marker Genotypes using a Permutation Test, and Application to Calpain 10 Polymorphisms Associated with Diabetes , 2003, Annals of human genetics.

[38]  M. Boehnke,et al.  Experimentally-derived haplotypes substantially increase the efficiency of linkage disequilibrium studies , 2001, Nature Genetics.

[39]  E. J. van den Oord,et al.  FP-TDI SNP scoring by manual and statistical procedures: a study of error rates and types. , 2003, BioTechniques.