Large scale reconstruction of haplotypes from genotype data

Critical to the understanding of the genetic basis for complex diseases is the modeling of human variation. Most of this variation can be characterized by single nucleotide polymorphisms (SNPs) which are mutations at a single nucleotide position. To characterize an individual's variation, we must determine an individual's haplotype or which nucleotide base occurs at each position of these common SNPs for each chromosome. In this paper, we present results for a highly accurate method for haplotype resolution from genotype data. Our method leverages a new insight into the underlying structure of haplotypes which shows that SNPs are organized in highly correlated "blocks". The majority of individuals have one of about four common haplotypes in each block. Our method partitions the SNPs into blocks and for each block, we predict the common haplotypes and each individual's haplotype. We evaluate our method over biological data. Our method predicts the common haplotypes perfectly and has a very low error rate (0.47%) when taking into account the predictions for the uncommon haplotypes. Our method is extremely efficient compared to previous methods, (a matter of seconds where previous methods needed hours). Its efficiency allows us to find the block partition of the haplotypes, to cope with missing data and to work with large data sets such as genotypes for thousands of SNPs for hundreds of individuals. The algorithm is available via webserver at http://www.cs.columbia.edu/compbio/hap.

[1]  Stephen A. Cook,et al.  The complexity of theorem-proving procedures , 1971, STOC.

[2]  Alon Itai,et al.  On the Complexity of Timetable and Multicommodity Flow Problems , 1976, SIAM J. Comput..

[3]  Robert E. Tarjan,et al.  A Linear-Time Algorithm for Testing the Truth of Certain Quantified Boolean Formulas , 1979, Inf. Process. Lett..

[4]  Mihalis Yannakakis,et al.  Optimization, Approximation, and Complexity Classes (Extended Abstract) , 1988, STOC 1988.

[5]  R. Hudson Gene genealogies and the coalescent process. , 1990 .

[6]  A. Clark,et al.  Inference of haplotypes from PCR-amplified samples of diploid populations. , 1990, Molecular biology and evolution.

[7]  Mihalis Yannakakis,et al.  Optimization, approximation, and complexity classes , 1991, STOC '88.

[8]  C.H. Papadimitriou,et al.  On selecting a satisfying truth assignment , 1991, [1991] Proceedings 32nd Annual Symposium of Foundations of Computer Science.

[9]  M. Yannakakis,et al.  Approximate Max--ow Min-(multi)cut Theorems and Their Applications , 1993 .

[10]  Mihalis Yannakakis,et al.  Approximate max-flow min-(multi)cut theorems and their applications , 1993, SIAM J. Comput..

[11]  L. Excoffier,et al.  Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. , 1995, Molecular biology and evolution.

[12]  K. Kidd,et al.  HAPLO: a program using the EM algorithm to estimate the frequencies of multi-site haplotypes. , 1995, The Journal of heredity.

[13]  J. Long,et al.  An E-M algorithm and testing strategy for multiple-locus haplotypes. , 1995, American journal of human genetics.

[14]  D. G. Eld A Practical Algorithm for Optimal Inference of Haplotypes from Diploid Populations , 2000 .

[15]  N. Schork,et al.  Accuracy of haplotype frequency estimation for biallelic loci, via the expectation-maximization algorithm for unphased diploid genotype data. , 2000, American journal of human genetics.

[16]  Dan Gusfield,et al.  Inference of Haplotypes from Samples of Diploid Populations: Complexity and Algorithms , 2001, J. Comput. Biol..

[17]  M. Daly,et al.  A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms , 2001, Nature.

[18]  S. P. Fodor,et al.  Blocks of Limited Haplotype Diversity Revealed by High-Resolution Scanning of Human Chromosome 21 , 2001, Science.

[19]  P. Donnelly,et al.  A new statistical method for haplotype reconstruction from population data. , 2001, American journal of human genetics.

[20]  Russell Schwartz,et al.  SNPs Problems, Algorithms and Complexity , 2001 .

[21]  M. Daly,et al.  High-resolution haplotype structure in the human genome , 2001, Nature Genetics.

[22]  Pardis C Sabeti,et al.  Linkage disequilibrium in the human genome , 2001, Nature.

[23]  Russell Schwartz,et al.  SNPs Problems, Complexity, and Algorithms , 2001, ESA.

[24]  D. Goldstein,et al.  Population genomics: Linkage disequilibrium holds the key , 2001, Current Biology.

[25]  Sinead B. O'Leary,et al.  Genetic variation in the 5q31 cytokine gene cluster confers susceptibility to Crohn disease , 2001, Nature Genetics.

[26]  M. Waterman,et al.  A dynamic programming algorithm for haplotype block partitioning , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[27]  Dan Gusfield,et al.  Haplotyping as perfect phylogeny: conceptual framework and efficient solutions , 2002, RECOMB '02.

[28]  R. Karp,et al.  Efficient reconstruction of haplotype structure via perfect phylogeny. , 2002, Journal of bioinformatics and computational biology.

[29]  Shibu Yooseph,et al.  Haplotyping as Perfect Phylogeny: A Direct Approach , 2003, J. Comput. Biol..