Robust normalization of DNA chromatograms by regression for improved base-calling

Abstract The accuracy of base-calling software depends critically on the data normalization method it is using. A new regression-based method for normalizing electrophoretic DNA traces is presented. It is based on a channel-coupled exponential decay model that can track accurately the trend of dropping peak heights (the so-called signal “skyline”) even in the low SNR region of the trace. We provide the justification and formulation of the regression model as well as the analytical estimation of its parameters. The proposed skyline-based normalization scheme is not affected by the presence of artifacts, compressions, false peaks, etc. and does not depend on the dye chemistry used. We demonstrate that it can improve the interpretation of chromatograms and therefore increase base-calling accuracy when employed in the pre-processing stage of either a pattern classification type base-caller (such as the BEM presented in Pereira et al. (Discrete Appl. Math. 104(1–3) (2000) 229)) or a kernel fitting base-calling method.

[1]  J M Lacroix,et al.  High performance DNA sequencing, and the detection of mutations and polymorphisms, on the Clipper sequencer , 1999, Electrophoresis.

[2]  L. J. Thomas,et al.  A method to determine the filter matrix in four‐dye fluorescence‐based DNA sequencing , 1997, Electrophoresis.

[3]  James C. Mullikin,et al.  A probabilistic approach for long read-length DNA sequence analysis , 2002, Proceedings of the 12th IEEE Workshop on Neural Networks for Signal Processing.

[4]  Elias S. Manolakos,et al.  Statistical learning formulation of the DNA base-calling problem and its solution in a Bayesian EM framework , 2000, Discret. Appl. Math..

[5]  Elias S. Manolakos,et al.  Signal Background Estimation and Baseline Correction Algorithms for Accurate DNA Sequencing , 2003, J. VLSI Signal Process..

[6]  Elias S. Manolakos,et al.  Automatic estimation of mobility shift coefficients in DNA chromatograms , 2003, 2003 IEEE XIII Workshop on Neural Networks for Signal Processing (IEEE Cat. No.03TH8718).

[7]  L. Alphey DNA Sequencing: From Experimental Methods to Bioinformatics , 1997 .

[8]  P. Green,et al.  Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.

[9]  Simon J. Godsill,et al.  Modelling electropherogram data for DNA sequencing using variable dimension MCMC , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[10]  Terence A. Brown DNA Sequencing: The Basics , 1994 .

[11]  Barry L. Karger,et al.  A maximum-likelihood base caller for DNA sequencing , 2000, IEEE Transactions on Biomedical Engineering.

[12]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[13]  M. Westphall,et al.  Automatic matrix determination in four dye fluorescence‐based DNA sequencing , 1996, Electrophoresis.

[14]  A. Berno A graph theoretic approach to the analysis of DNA sequencing data. , 1996, Genome research.

[15]  M. Westphall,et al.  A software system for data analysis in automated DNA sequencing. , 1998, Genome research.

[16]  S. Pasupathy,et al.  Optimal structure for automatic processing of DNA sequences , 1999, IEEE Transactions on Biomedical Engineering.