Understanding the accuracy of statistical haplotype inference with sequence data of known phase

Statistical methods for haplotype inference from multi‐site genotypes of unrelated individuals have important application in association studies and population genetics. Understanding the factors that affect the accuracy of this inference is important, but their assessment has been restricted by the limited availability of biological data with known phase. We created hybrid cell lines monosomic for human chromosome 19 and produced single‐chromosome complete sequences of a 48 kb genomic region in 39 individuals of African American (AA) and European American (EA) origin. We employ these phase‐known genotypes and coalescent simulations to assess the accuracy of statistical haplotype reconstruction by several algorithms. Accuracy of phase inference was considerably low in our biological data even for regions as short as 25–50 kb, suggesting that caution is needed when analyzing reconstructed haplotypes. Moreover, the reliability of estimated confidence in phase inference is not high enough to allow for a reliable incorporation of site‐specific uncertainty information in subsequent analyses. We show that, in samples of certain mixed ancestry (AA and EA populations), the most accurate haplotypes are probably obtained when increasing sample size by considering the largest, pooled sample, despite the hypothetical problems associated with pooling across those heterogeneous samples. Strategies to improve confidence in reconstructed haplotypes, and realistic alternatives to the analysis of inferred haplotypes, are discussed. Genet. Epidemiol. © 2007 Wiley‐Liss, Inc.

[1]  Tomoko Tahira,et al.  Genome-wide definitive haplotypes determined using a collection of complete hydatidiform moles. , 2005, Genome research.

[2]  S. Gabriel,et al.  The Structure of Haplotype Blocks in the Human Genome , 2002, Science.

[3]  N. Morton Blocks of Limited Haplotype Diversity , 2007 .

[4]  C. Carlson,et al.  Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. , 2004, American journal of human genetics.

[5]  W. G. Hill,et al.  Estimation of linkage disequilibrium in randomly mating populations , 1974, Heredity.

[6]  Jason Cooper,et al.  Use of unphased multilocus genotype data in indirect association studies , 2004, Genetic epidemiology.

[7]  A. Clark,et al.  Inference of haplotypes from PCR-amplified samples of diploid populations. , 1990, Molecular biology and evolution.

[8]  L. Excoffier,et al.  Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. , 1995, Molecular biology and evolution.

[9]  Pardis C Sabeti,et al.  Linkage disequilibrium in the human genome , 2001, Nature.

[10]  Dan Gusfield,et al.  Perfect phylogeny haplotyper: haplotype inferral using a tree model , 2003, Bioinform..

[11]  Eran Halperin,et al.  Haplotype reconstruction from genotype data using Imperfect Phylogeny , 2004, Bioinform..

[12]  A. Chakravarti,et al.  Haplotype inference in random population samples. , 2002, American journal of human genetics.

[13]  M. Olivier A haplotype map of the human genome , 2003, Nature.

[14]  A. Morris,et al.  Little loss of information due to unknown phase for fine-scale linkage-disequilibrium mapping with single-nucleotide-polymorphism genotype data. , 2004, American journal of human genetics.

[15]  Sequence variability of a human pseudogene. , 2001, Genome research.

[16]  T. Niu Algorithms for inferring haplotypes , 2004, Genetic epidemiology.

[17]  M. Olivier A haplotype map of the human genome. , 2003, Nature.

[18]  Richard R. Hudson,et al.  Generating samples under a Wright-Fisher neutral model of genetic variation , 2002, Bioinform..

[19]  Daniel J Schaid,et al.  Linkage Disequilibrium Testing When Linkage Phase Is Unknown , 2004, Genetics.

[20]  Jennifer Wessel,et al.  A comprehensive literature review of haplotyping software and methods for use with unrelated individuals , 2005, Human Genomics.

[21]  P. Donnelly,et al.  A new statistical method for haplotype reconstruction from population data. , 2001, American journal of human genetics.

[22]  A. Clark,et al.  The role of haplotypes in candidate gene studies , 2004, Genetic epidemiology.

[23]  Frank Dudbridge,et al.  Haplotype tagging for the identification of common disease genes , 2001, Nature Genetics.

[24]  R. Qiu,et al.  Targeted, haplotype-resolved resequencing of long segments of the human genome. , 2005, Genomics.

[25]  James R. Eshleman,et al.  Conversion of diploidy to haploidy , 2000, Nature.

[26]  K. Kidd,et al.  HAPLO: a program using the EM algorithm to estimate the frequencies of multi-site haplotypes. , 1995, The Journal of heredity.

[27]  S. P. Fodor,et al.  Blocks of Limited Haplotype Diversity Revealed by High-Resolution Scanning of Human Chromosome 21 , 2001, Science.

[28]  M. Boehnke,et al.  Experimentally-derived haplotypes substantially increase the efficiency of linkage disequilibrium studies , 2001, Nature Genetics.

[29]  Zhaohui S. Qin,et al.  A comparison of phasing algorithms for trios and unrelated individuals. , 2006, American journal of human genetics.

[30]  J. Long,et al.  An E-M algorithm and testing strategy for multiple-locus haplotypes. , 1995, American journal of human genetics.

[31]  Gabor T. Marth,et al.  The Allele Frequency Spectrum in Genome-Wide Human Variation Data Reveals Signals of Differential Demographic History in Three Large World Populations , 2004, Genetics.

[32]  Mark Daly,et al.  Haploview: analysis and visualization of LD and haplotype maps , 2005, Bioinform..

[33]  Ron Shamir,et al.  GERBIL: Genotype resolution and block identification using likelihood. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[34]  Peter Donnelly,et al.  A comparison of bayesian methods for haplotype reconstruction from population genotype data. , 2003, American journal of human genetics.

[35]  B. Weir,et al.  6. Complete Characterization Of Disequilibrium At Two Loci , 1989 .

[36]  C. Haley,et al.  The Impact of Using Related Individuals for Haplotype Reconstruction in Population Studies , 2005, Genetics.

[37]  Paul Scheet,et al.  A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. , 2006, American journal of human genetics.

[38]  S. Tishkoff,et al.  Global Patterns of Linkage Disequilibrium at the CD4 Locus and Modern Human Origins , 1996, Science.

[39]  E. Boerwinkle,et al.  Haplotype structure and population genetic inferences from nucleotide-sequence variation in human lipoprotein lipase. , 1998, American journal of human genetics.

[40]  L. Excoffier,et al.  Gametic phase estimation over large genomic regions using an adaptive window approach , 2003, Human Genomics.

[41]  K K Kidd,et al.  Understanding human DNA sequence variation. , 2004, The Journal of heredity.

[42]  Zhaohui S. Qin,et al.  Bayesian haplotype inference for multiple linked single-nucleotide polymorphisms. , 2002, American journal of human genetics.

[43]  Pardis C Sabeti,et al.  Detecting recent positive selection in the human genome from haplotype structure , 2002, Nature.

[44]  K. Kidd,et al.  Worldwide genetic analysis of the CFTR region. , 2001, American journal of human genetics.

[45]  M. Stephens,et al.  Accounting for Decay of Linkage Disequilibrium in Haplotype Inference and Missing-data Imputation , 2022 .

[46]  J. Stephens,et al.  Theoretical underpinning of the single-molecule-dilution (SMD) method of direct haplotype resolution. , 1990, American journal of human genetics.

[47]  P. Donnelly,et al.  A Fine-Scale Map of Recombination Rates and Hotspots Across the Human Genome , 2005, Science.