naiveBayesCall: An Efficient Model-Based Base-Calling Algorithm for High-Throughput Sequencing

Immense amounts of raw instrument data (i.e., images of fluorescence) are currently being generated using ultra high-throughput sequencing platforms An important computational challenge associated with this rapid advancement is to develop efficient algorithms that can extract accurate sequence information from raw data To address this challenge, we recently introduced a novel model-based base-calling algorithm that is fully parametric and has several advantages over previously proposed methods Our original algorithm, called BayesCall, significantly reduced the error rate, particularly in the later cycles of a sequencing run, and also produced useful base-specific quality scores with a high discrimination ability Unfortunately, however, BayesCall is too computationally expensive to be of broad practical use In this paper, we build on our previous model-based approach to devise an efficient base-calling algorithm that is orders of magnitude faster than BayesCall, while still maintaining a comparably high level of accuracy Our new algorithm is called naiveBayesCall, and it utilizes approximation and optimization methods to achieve scalability We describe the performance of naiveBayesCall and demonstrate how improved base-calling accuracy may facilitate de novo assembly when the coverage is low to moderate.

[1]  J. Kiefer,et al.  Sequential minimax search for a maximum , 1953 .

[2]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[3]  M. Westphall,et al.  Automatic matrix determination in four dye fluorescence‐based DNA sequencing , 1996, Electrophoresis.

[4]  P Green,et al.  Base-calling of automated sequencer traces using phred. II. Error probabilities. , 1998, Genome research.

[5]  P. Green,et al.  Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.

[6]  T. Speed,et al.  An estimate of the crosstalk matrix in four‐dye fluorescence‐based DNA sequencing , 1999, Electrophoresis.

[7]  M. Metzker Emerging technologies in DNA sequencing. , 2005, Genome research.

[8]  D. Bentley,et al.  Whole-genome re-sequencing. , 2006, Current opinion in genetics & development.

[9]  S. Batzoglou,et al.  Whole-Genome Sequencing and Assembly with High-Throughput, Short-Read Technologies , 2007, PloS one.

[10]  P. Mitra,et al.  Alta-Cyclic: a self-optimizing base caller for next-generation sequencing , 2008, Nature Methods.

[11]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[12]  Francisco M. De La Vega,et al.  Population genetic analysis of shotgun assemblies of genomic sequences from multiple individuals. , 2008, Genome research.

[13]  Ioannis Xenarios,et al.  BMC Bioinformatics BioMed Central Methodology article Probabilistic base calling of Solexa sequencing data , 2022 .

[14]  C. Nusbaum,et al.  ALLPATHS: de novo assembly of whole-genome shotgun microreads. , 2008, Genome research.

[15]  R. Durbin,et al.  Mapping Quality Scores Mapping Short Dna Sequencing Reads and Calling Variants Using P

, 2022 .

[16]  C. Nusbaum,et al.  Quality scores and SNP detection in sequencing-by-synthesis systems. , 2008, Genome research.

[17]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[18]  Paul Medvedev,et al.  Ab Initio Whole Genome Shotgun Assembly with Mated Short Reads , 2008, RECOMB.

[19]  Mark J. P. Chaisson,et al.  De novo fragment assembly with short mate-paired reads: Does the read length matter? , 2009, Genome research.

[20]  Yun S. Song,et al.  BayesCall: A model-based base-calling algorithm for high-throughput short-read sequencing. , 2009, Genome research.

[21]  S. Tavaré,et al.  Population Genetic Inference From Resequencing Data , 2009, Genetics.

[22]  Irina I. Abnizova,et al.  Swift: primary data analysis for the Illumina Solexa sequencing platform , 2009, Bioinform..