Haplotype reconstruction from SNP alignment

In this paper we describe a method for statistical reconstruction of haplotypes from a set of aligned SNP fragments. We consider the case of a pair of homologous human chromosomes, one from the mother and the other from the father. After fragment assembly and SNP detection, we wish to reconstruct the two haplotypes of the parents. Given a set of SNP sites inferred from the assembly alignment, we wish to divide the fragment set into two subsets, each of which represents one chromosome. Our method is based on a statistical model of sequencing errors, compositional information and haplotype memberships.We calculate probabilities of different haplotypes conditional on the alignment. Due to computational complexity, we first determine phases for neighboring SNPs. Then we connect them and construct haplotype segments. Also we compute the accuracy or confidence of the reconstructed haplotypes. We discuss other issues such as alternative methods, parameter estimation, computational efficiency, and relaxation of assumptions.

[1]  M. Waterman,et al.  The accuracy of DNA sequences: estimating sequence quality. , 1992, Genomics.

[2]  W. Wong,et al.  The calculation of posterior distributions by data augmentation , 1987 .

[3]  L. Kedes,et al.  Nomenclature for incompletely specified bases in nucleic acid sequences. Recommendations 1984. Nomenclature Committee of the International Union of Biochemistry (NC-IUB). , 1986, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Christopher J. Lee,et al.  Genome-wide analysis of single-nucleotide polymorphisms in human expressed sequences , 2000, Nature Genetics.

[5]  D. Nickerson,et al.  PolyPhred: automating the detection and genotyping of single nucleotide substitutions using fluorescence-based resequencing. , 1997, Nucleic acids research.

[6]  P Green,et al.  Base-calling of automated sequencer traces using phred. II. Error probabilities. , 1998, Genome research.

[7]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[8]  Russell Schwartz,et al.  SNPs Problems, Complexity, and Algorithms , 2001, ESA.

[9]  D. Mccormick Sequence the Human Genome , 1986, Bio/Technology.

[10]  A. Cornish-Bowden Nomenclature for incompletely specified bases in nucleic acid sequences: recommendations 1984. , 1985, Nucleic acids research.

[11]  Michael S. Waterman,et al.  Introduction to computational biology , 1995 .

[12]  E. Lander,et al.  Genomic mapping by fingerprinting random clones: a mathematical analysis. , 1988, Genomics.