Accuracy Assessment of Diploid Consensus Sequences

If the origins of fragments are known in genome sequencing projects, it is straightforward to reconstruct diploid consensus sequences. In reality, however, this is not true. Although there are proposed methods to reconstruct haplotypes from genome sequencing projects, an accuracy assessment is required to evaluate the confidence of the estimated diploid consensus sequences. In this paper, we define the confidence score of diploid consensus sequences. It requires the calculation of the likelihood of an assembly. To calculate the likelihood, we propose a linear time algorithm with respect to the number of polymorphic sites. The likelihood calculation and confidence score are used for further improvements of haplotype estimation in two directions. One direction is that low-scored phases are disconnected. The other direction is that, instead of using nominal frequency 1/2, the haplotype frequency is estimated to reflect the actual contribution of each haplotype. Our method was evaluated on the simulated data whose polymorphism rate (1.2 percent) was based on Ciona intestinalis. As a result, the high accuracy of our algorithm was indicated: The true positive rate of the haplotype estimation was greater than 97 percent.

[1]  E. Lander,et al.  Genomic mapping by fingerprinting random clones: a mathematical analysis. , 1988, Genomics.

[2]  M. Waterman,et al.  The accuracy of DNA sequences: estimating sequence quality. , 1992, Genomics.

[3]  Jong Hyun Kim,et al.  Haplotype Reconstruction from SNP Alignment , 2004, J. Comput. Biol..

[4]  P Green,et al.  Base-calling of automated sequencer traces using phred. II. Error probabilities. , 1998, Genome research.

[5]  Stephen M. Mount,et al.  The genome sequence of Drosophila melanogaster. , 2000, Science.

[6]  Paramvir S. Dehal,et al.  Whole-Genome Shotgun Assembly and Analysis of the Genome of Fugu rubripes , 2002, Science.

[7]  Sheldon M. Ross,et al.  Introduction to Probability Models, Eighth Edition , 1972 .

[8]  Lei M. Li,et al.  Adjust quality scores from alignment and improve sequencing accuracy. , 2004, Nucleic acids research.

[9]  Paul Richardson,et al.  The Draft Genome of Ciona intestinalis: Insights into Chordate and Vertebrate Origins , 2002, Science.

[10]  G. Lancia,et al.  Algorithmic Strategies for the SNP Haplotype Assembly Problem , 2002 .

[11]  Timothy B. Stockwell,et al.  The Sequence of the Human Genome , 2001, Science.

[12]  P. Green,et al.  Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.

[13]  M. Adams,et al.  Recent Segmental Duplications in the Human Genome , 2002, Science.

[14]  Russell Schwartz,et al.  SNPs Problems, Complexity, and Algorithms , 2001, ESA.

[15]  Russell Schwartz,et al.  Algorithmic strategies for the single nucleotide polymorphism haplotype assembly problem , 2002, Briefings Bioinform..