HIDDEN MARKOV MODELS FOR DNA SEQUENCING

In this paper we propose Hidden Markov Models as an approach to the DNA basecalling problem. We model the state emission densities using Artificial Neural Networks, and provide a modified Baum-Welch re-estimation procedure to perform training. Moreover, we develop a method that exploits consensus sequences to label training data, thus minimizing the need for hand-labeling. Our results demonstrate the potential of these models and suggest further research. We also perform a careful study of the basecalling errors and propose alternative HMM topologies that might further improve performance. We conclude by suggesting further research directions.