HaploRec: efficient and accurate large-scale reconstruction of haplotypes

BackgroundHaplotypes extracted from human DNA can be used for gene mapping and other analysis of genetic patterns within and across populations. A fundamental problem is, however, that current practical laboratory methods do not give haplotype information. Estimation of phased haplotypes of unrelated individuals given their unphased genotypes is known as the haplotype reconstruction or phasing problem.ResultsWe define three novel statistical models and give an efficient algorithm for haplotype reconstruction, jointly called HaploRec. HaploRec is based on exploiting local regularities conserved in haplotypes: it reconstructs haplotypes so that they have maximal local coherence. This approach – not assuming statistical dependence for remotely located markers – has two useful properties: it is well-suited for sparse marker maps, such as those used in gene mapping, and it can actually take advantage of long maps.ConclusionOur experimental results with simulated and real data show that HaploRec is a powerful method for the large scale haplotyping needed in association studies. With sample sizes large enough for gene mapping it appeared to be the best compared to all other tested methods (Phase, fastPhase, PL-EM, Snphap, Gerbil; simulated data), with small samples it was competitive with the best available methods (real data). HaploRec is several orders of magnitude faster than Phase and comparable to the other methods; the running times are roughly linear in the number of subjects and the number of markers. HaploRec is publicly available at http://www.cs.helsinki.fi/group/genetics/haplotyping.html.

[1]  Dan Gusfield,et al.  An Overview of Combinatorial Methods for Haplotype Inference , 2002, Computational Methods for SNPs and Haplotype Inference.

[2]  M. Stephens,et al.  Accounting for Decay of Linkage Disequilibrium in Haplotype Inference and Missing-data Imputation , 2022 .

[3]  Toshihiro Tanaka The International HapMap Project , 2003, Nature.

[4]  L. Kruglyak Prospects for whole-genome linkage disequilibrium mapping of common disease genes , 1999, Nature Genetics.

[5]  N. Schork,et al.  Accuracy of haplotype frequency estimation for biallelic loci, via the expectation-maximization algorithm for unphased diploid genotype data. , 2000, American journal of human genetics.

[6]  M. Xiong,et al.  Haplotypes vs single marker linkage disequilibrium tests: what do we gain? , 2001, European Journal of Human Genetics.

[7]  Alex Bateman,et al.  An introduction to hidden Markov models. , 2007, Current protocols in bioinformatics.

[8]  Roded Sharan,et al.  A Note on Phasing Long Genomic Regions Using Local Haplotype Predictions , 2006, J. Bioinform. Comput. Biol..

[9]  Dana Ron,et al.  The Power of Amnesia , 1993, NIPS.

[10]  L. Partridge,et al.  Oxford Surveys in Evolutionary Biology , 1991 .

[11]  R. Hudson Gene genealogies and the coalescent process. , 1990 .

[12]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[13]  A. Clark,et al.  Inference of haplotypes from PCR-amplified samples of diploid populations. , 1990, Molecular biology and evolution.

[14]  L. Excoffier,et al.  Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. , 1995, Molecular biology and evolution.

[15]  P. Donnelly,et al.  A new statistical method for haplotype reconstruction from population data. , 2001, American journal of human genetics.

[16]  A. Morris,et al.  Little loss of information due to unknown phase for fine-scale linkage-disequilibrium mapping with single-nucleotide-polymorphism genotype data. , 2004, American journal of human genetics.

[17]  Yanfa Yan,et al.  Alloys: Atomic structure of the quasicrystal Al72Ni20Co8 , 2000, Nature.

[18]  James R. Eshleman,et al.  Conversion of diploidy to haploidy , 2000, Nature.

[19]  Dan Gusfield,et al.  Haplotype Inference by Pure Parsimony , 2003, CPM.

[20]  Dan Geiger,et al.  Model-based inference of haplotype block variation , 2003, RECOMB '03.

[21]  Zhaohui S. Qin,et al.  Bayesian haplotype inference for multiple linked single-nucleotide polymorphisms. , 2002, American journal of human genetics.

[22]  M. Daly,et al.  High-resolution haplotype structure in the human genome , 2001, Nature Genetics.

[23]  D. Clayton,et al.  Genome-wide association studies: theoretical and practical concerns , 2005, Nature Reviews Genetics.

[24]  Ron Shamir,et al.  GERBIL: Genotype resolution and block identification using likelihood. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[25]  Dan Gusfield,et al.  A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem , 2005, RECOMB.

[26]  Richard R. Hudson,et al.  Generating samples under a Wright-Fisher neutral model of genetic variation , 2002, Bioinform..

[27]  E. Eskin,et al.  Optimally Phasing Long Genomic Regions using Local Haplotype Predictions , 2008 .

[28]  Jay Shendure,et al.  Long-range polony haplotyping of individual human chromosome molecules , 2006, Nature Genetics.

[29]  Ran El-Yaniv,et al.  On Prediction Using Variable Order Markov Models , 2004, J. Artif. Intell. Res..

[30]  Pierre Dupont,et al.  Improved Smoothing for Probabilistic Suffix Trees Seen as Variable Order Markov Chains , 2002, ECML.

[31]  Paul Scheet,et al.  A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. , 2006, American journal of human genetics.

[32]  H Toivonen,et al.  An empirical comparison of case-control and trio based study designs in high throughput association mapping , 2005, Journal of Medical Genetics.

[33]  Jennifer Wessel,et al.  A comprehensive literature review of haplotyping software and methods for use with unrelated individuals , 2005, Human Genomics.

[34]  David Curtis,et al.  Estimated haplotype counts from case-control samples cannot be treated as observed counts. , 2006, American journal of human genetics.

[35]  Zhaohui S. Qin,et al.  Partition-ligation-expectation-maximization algorithm for haplotype inference with single-nucleotide polymorphisms. , 2002, American journal of human genetics.

[36]  Jean-François Zagury,et al.  Computation of haplotypes on SNPs subsets: advantage of the "global method" , 2006, BMC Genetics.

[37]  Shibu Yooseph,et al.  A Survey of Computational Methods for Determining Haplotypes , 2002, Computational Methods for SNPs and Haplotype Inference.

[38]  Hannu Toivonen,et al.  A Markov Chain Approach to Reconstruction of Long Haplotypes , 2003, Pacific Symposium on Biocomputing.