Statistical learning formulation of the DNA base-calling problem and its solution in a Bayesian EM framework

Abstract A novel formulation of the important DNA sequence base-calling problem as well as algorithms for its solution are introduced. The proposed approach is to bring DNA base-calling within the framework of a powerful statistical learning paradigm, which allows the incorporation of prior knowledge about the structure of the problem directly into the base-calling algorithms, without resorting to heuristics. Use of prior knowledge provides constraints which help disambiguate the different possible interpretations that the data may have at regions of low SNR, and is shown to lead to a substantial increase of the number of DNA bases that can be accurately called in such regions. Our experimental results suggest that the proposed algorithms, without being optimized, can achieve base-calling performance that matches, and often exceeds, that of commercially available software. Furthermore, due to their statistical basis, they also provide confidence estimates (in the form of posterior probabilities) for the produced base call decisions, which can be used for sequence assembly and mutation detection purposes.

[1]  R. Viswanathan,et al.  Application of expectation-maximization algorithm to the detection of a direct-sequence signal in pulsed noise jamming , 1993, IEEE Trans. Commun..

[2]  A. A. Reilly,et al.  An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences , 1990, Proteins.

[3]  P. Green,et al.  Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.

[4]  A. Berno A graph theoretic approach to the analysis of DNA sequencing data. , 1996, Genome research.

[5]  Roy L. Streit,et al.  Maximum likelihood training of probabilistic neural networks , 1994, IEEE Trans. Neural Networks.

[6]  A. F. Smith,et al.  Statistical analysis of finite mixture distributions , 1986 .

[7]  G. C. Tiao,et al.  Bayesian inference in statistical analysis , 1973 .

[8]  L. M. Smith,et al.  An adaptive, object oriented strategy for base calling in DNA sequence analysis. , 1993, Nucleic acids research.

[9]  Elias S. Manolakos,et al.  Unsupervised statistical neural networks for model-based object recognition , 1997, IEEE Trans. Signal Process..

[10]  K. Schittkowski,et al.  NONLINEAR PROGRAMMING , 2022 .

[11]  Ilan Ziskind,et al.  Maximum-likelihood localization of narrow-band autoregressive sources via the EM algorithm , 1993, IEEE Trans. Signal Process..

[12]  Analysis of the Effects of Different DNA Sequencing Methods on Accuracy , Quality and Expansion of a Web-Based Sequencing Resource : Results of the ABRF DNA Sequencing Group 1999 Study , 1999 .

[13]  James B. Golden,et al.  Pattern Recognition for Automated DNA Sequencing: I. On-Line Signal Conditioning and Feature Extraction for Basecalling , 1993, ISMB.

[14]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[15]  M. Westphall,et al.  A software system for data analysis in automated DNA sequencing. , 1998, Genome research.

[16]  C. Tibbetts,et al.  Neural Networks for Automated Base-calling of Gel-based DNA Sequencing Ladders , 1994 .