Bayesian Basecalling for DNA Sequence Analysis using Hidden Markov Models

It has been shown that electropherograms of DNA sequences can be modelled with hidden Markov models. Base-calling, the procedure that determines the sequence of bases from the given eletropherogram, can then be performed using the Viterbi algorithm. A training step is required prior to basecalling in order to estimate the HMM parameters. In this paper, we propose a Bayesian approach which employs the Markov chain Monte Carlo (MCMC) method to perform basecalling. Such an approach not only allows one to naturally encode the prior biological knowledge into the basecalling algorithm, it also exploits both the training data and the basecalling data in estimating the HMM parameters, leading to more accurate estimates. Using the recently sequenced genome of the organism Legionella pneumophila we show that similar performance as the state-of-the-art basecalling algorithm in terms of total errors can be achieved even when a simple Gaussian model is assumed for the emission densities.

[1]  Xiaodong Wang,et al.  Estimating the number of competing terminals in an IEEE 802.11 wireless network , 2004 .

[2]  Sylvia Richardson,et al.  Markov Chain Monte Carlo in Practice , 1997 .

[3]  Lawrence R. Rabiner,et al.  A tutorial on Hidden Markov Models , 1986 .

[4]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[5]  Nando de Freitas,et al.  Robust Full Bayesian Learning for Radial Basis Networks , 2001, Neural Computation.

[6]  Petros Boufounos,et al.  Basecalling using hidden Markov models , 2004, J. Frankl. Inst..

[7]  John B. Moore,et al.  Hybrid Algorithms For Maximum Likelihood And Maximum A Posterior Sequence Estimation , 1996, Fourth International Symposium on Signal Processing and Its Applications.

[8]  G. Casella,et al.  Explaining the Gibbs Sampler , 1992 .

[9]  Xiaobo Zhou,et al.  A Bayesian connectivity-based approach to constructing probabilistic gene regulatory networks , 2004, Bioinform..

[10]  Bruce Alberts,et al.  Essential Cell Biology , 1983 .

[11]  F. Sanger,et al.  DNA sequencing with chain-terminating inhibitors. , 1977, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Simon J. Godsill,et al.  Sequential methods for DNA sequencing , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[13]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[14]  I. Chou,et al.  The Genomic Sequence of the Accidental Pathogen Legionella pneumophila , 2004, Science.

[15]  P. Green,et al.  Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.

[16]  Christophe Andrieu,et al.  Robust Full Bayesian Learning for Neural Networks , 1999 .

[17]  Peter Green,et al.  Markov chain Monte Carlo in Practice , 1996 .

[18]  Simon J. Godsill,et al.  Modelling electropherogram data for DNA sequencing using variable dimension MCMC , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).