Finding optimal threshold for correction error reads in DNA assembling

BackgroundDNA assembling is the problem of determining the nucleotide sequence of a genome from its substrings, called reads. In the experiments, there may be some errors on the reads which affect the performance of the DNA assembly algorithms. Existing algorithms, e.g. ECINDEL and SRCorr, correct the error reads by considering the number of times each length-k substring of the reads appear in the input. They treat those length-k substrings appear at least M times as correct substring and correct the error reads based on these substrings. However, since the threshold M is chosen without any solid theoretical analysis, these algorithms cannot guarantee their performances on error correction.ResultsIn this paper, we propose a method to calculate the probabilities of false positive and false negative when determining whether a length-k substring is correct using threshold M. Based on this optimal threshold M that minimizes the total errors (false positives and false negatives). Experimental results on both real data and simulated data showed that our calculation is correct and we can reduce the total error substrings by 77.6% and 65.1% when compared to ECINDEL and SRCorr respectively.ConclusionWe introduced a method to calculate the probability of false positives and false negatives of the length-k substring using different thresholds. Based on this calculation, we found the optimal threshold to minimize the total error of false positive plus false negative.

[1]  Eugene W. Myers,et al.  A whole-genome assembly of Drosophila. , 2000, Science.

[2]  D. Bentley,et al.  Whole-genome re-sequencing. , 2006, Current opinion in genetics & development.

[3]  René L. Warren,et al.  Assembling millions of short DNA sequences using SSAKE , 2006, Bioinform..

[4]  R. Contreras,et al.  Complete nucleotide sequence of SV40 DNA , 1978, Nature.

[5]  P Green,et al.  Base-calling of automated sequencer traces using phred. II. Error probabilities. , 1998, Genome research.

[6]  S. Batzoglou,et al.  Whole-Genome Sequencing and Assembly with High-Throughput, Short-Read Technologies , 2007, PloS one.

[7]  Paul Medvedev,et al.  Ab Initio Whole Genome Shotgun Assembly with Mated Short Reads , 2008, RECOMB.

[8]  Haixu Tang,et al.  Fragment assembly with short reads , 2004, Bioinform..

[9]  B. Barrell,et al.  Analysis of the protein-coding content of the sequence of human cytomegalovirus strain AD169. , 1990, Current topics in microbiology and immunology.

[10]  James R. Knight,et al.  Genome sequencing in microfabricated high-density picolitre reactors , 2005, Nature.

[11]  Juliane C. Dohm,et al.  SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing. , 2007, Genome research.

[12]  F. Sanger,et al.  Nucleotide sequence of bacteriophage lambda DNA. , 1982, Journal of molecular biology.

[13]  Haixu Tang,et al.  Fragment assembly with double-barreled data , 2001, ISMB.

[14]  Vincent J. Magrini,et al.  Extending assembly of short DNA sequences to handle error , 2007, Bioinform..

[15]  Siu-Ming Yiu,et al.  Correcting short reads with high error rates for improved sequencing result , 2009, Int. J. Bioinform. Res. Appl..

[16]  P. Green,et al.  Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.

[17]  F. Sanger,et al.  Cloning in single-stranded bacteriophage as an aid to rapid DNA sequencing. , 1980, Journal of molecular biology.

[18]  F. Sanger,et al.  DNA sequencing with chain-terminating inhibitors. , 1977, Proceedings of the National Academy of Sciences of the United States of America.