Signal Background Estimation and Baseline Correction Algorithms for Accurate DNA Sequencing

Accurate identification of a DNA sequence depends on the ability to precisely track the time varying signal baseline in all parts of the electrophoretic trace. We propose a statistical learning formulation of the signal background estimation problem that can be solved using an Expectation-Maximization type algorithm. We also present an alternative method for estimating the background level of a signal in small size windows based on a recursive histogram computation. Both background estimation algorithms introduced here can be combined with regression methods in order to track slow and fast baseline changes occurring in different regions of a DNA chromatogram. Accurate baseline tracking improves cluster separation and thus contributes to the reduction in classification errors when the Bayesian EM (BEM) base-calling system, developed in our group (Pereira et al., Discrete Applied Mathematics, 2000), is employed to decide how many bases are “hidden” in every base-call event pattern extracted from the chromatogram.

[1]  James B. Golden,et al.  Pattern Recognition for Automated DNA Sequencing: I. On-Line Signal Conditioning and Feature Extraction for Basecalling , 1993, ISMB.

[2]  Elias S. Manolakos,et al.  Accurate estimation of the signal baseline in DNA chromatograms , 2002, Proceedings of the 12th IEEE Workshop on Neural Networks for Signal Processing.

[3]  Elias S. Manolakos,et al.  Statistical learning formulation of the DNA base-calling problem and its solution in a Bayesian EM framework , 2000, Discret. Appl. Math..

[4]  A. Berno A graph theoretic approach to the analysis of DNA sequencing data. , 1996, Genome research.

[5]  James C. Mullikin,et al.  A probabilistic approach for long read-length DNA sequence analysis , 2002, Proceedings of the 12th IEEE Workshop on Neural Networks for Signal Processing.

[6]  Terence A. Brown DNA Sequencing: The Basics , 1994 .

[7]  Barry L. Karger,et al.  A maximum-likelihood base caller for DNA sequencing , 2000, IEEE Transactions on Biomedical Engineering.

[8]  J M Lacroix,et al.  High performance DNA sequencing, and the detection of mutations and polymorphisms, on the Clipper sequencer , 1999, Electrophoresis.

[9]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[10]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[11]  M. Westphall,et al.  Automatic matrix determination in four dye fluorescence‐based DNA sequencing , 1996, Electrophoresis.

[12]  H Fujii,et al.  Compensation for mobility inequalities between lanes computed from band signals in on‐line fluorescence DNA sequencing , 1992, Electrophoresis.

[13]  L. Alphey DNA Sequencing: From Experimental Methods to Bioinformatics , 1997 .

[14]  P. Green,et al.  Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.

[15]  M. Westphall,et al.  A software system for data analysis in automated DNA sequencing. , 1998, Genome research.

[16]  T. Moon The expectation-maximization algorithm , 1996, IEEE Signal Process. Mag..

[17]  M Morris,et al.  Basecalling with LifeTrace. , 2001, Genome research.