The 4M (Mixed Memory Markov Model) Algorithm for Finding Genes in Prokaryotic Genomes

In this paper, we present a new algorithm called 4M (mixed memory Markov model) for finding genes from the genomes of prokaryotes. This is achieved by modeling the known coding regions of the genome as a set of sample paths of a multistep Markov chain (call it ) and the known non-coding regions as a set of sample paths of another multistep Markov chain (call it ). The new feature of the 4M algorithm is that different states are allowed to have different memory lengths, in contrast to a fixed multistep Markov model used in GeneMark in its various versions. At the same time, compared with an algorithm like Glimmer3 that uses an interpolation of Markov models of different memory lengths, the statistical significance of the conclusions drawn from the 4M algorithm is quite easy to quantify. Thus, when a whole genome annotation is carried out and several new genes are predicted, it is extremely easy to rank these predictions in terms of the confidence one has in the predictions. The basis of the 4M algorithm is a simple rank condition satisfied by the matrix of frequencies associated with a Markov chain. The 4M algorithm is validated by applying it to 75 organisms belonging to practically all known families of bacteria and archae. The performance of the 4M algorithm is compared with those of Glimmer3, GeneMark2.5d, and GeneMarkHMM2.6g. It is found that, in a vast majority of cases, the 4M algorithm finds many more genes than it misses, compared with any of the other three algorithms. Next, the 4M algorithm is used to carry out whole genome annotation of 13 organisms by using 50% of the known genes as the training input for the coding model and 20% of the known non-genes as the training input for the non-coding model. After this, all of the open reading frames are classified. It is found that the 4M algorithm is highly specific in that it picks out virtually all of the known genes, while predicting that only a small number of the open reading frames whose status is unknown are genes.

[1]  Fady Alajaji,et al.  The Kullback-Leibler divergence rate between Markov sources , 2004, IEEE Transactions on Information Theory.

[2]  Igor Vajda,et al.  On Divergences and Informations in Statistics and Information Theory , 2006, IEEE Transactions on Information Theory.

[3]  D. Haussler,et al.  A hidden Markov model that finds genes in E. coli DNA. , 1994, Nucleic acids research.

[4]  Paul C. Shields,et al.  The positive-divergence and blowing-up properties , 1994 .

[5]  Dan Gusfield Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[6]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[7]  M. Borodovsky,et al.  GeneMark.hmm: new solutions for gene finding. , 1998, Nucleic acids research.

[8]  S. Karlin,et al.  Finding the genes in genomic DNA. , 1998, Current opinion in structural biology.

[9]  Mathukumalli Vidyasagar Bounds on the kullback-leibler divergence rate between hidden markov models , 2007, 2007 46th IEEE Conference on Decision and Control.

[10]  R. Gray Entropy and Information Theory , 1990, Springer New York.

[11]  D. Mccormick Sequence the Human Genome , 1986, Bio/Technology.

[12]  L. R. Rabiner,et al.  A probabilistic distance measure for hidden Markov models , 1985, AT&T Technical Journal.

[13]  S. Salzberg,et al.  Improved microbial gene identification with GLIMMER. , 1999, Nucleic acids research.

[14]  Gregory R. Grant,et al.  Statistical Methods in Bioinformatics , 2001 .

[15]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[16]  S. Tiwari,et al.  Prediction of probable genes by Fourier analysis of genomic sequences , 1997, Comput. Appl. Biosci..

[17]  Steven Salzberg,et al.  An empirical analysis of training protocols for probabilistic gene finders , 2005, BMC Bioinformatics.

[18]  S. Salzberg,et al.  Microbial gene identification using interpolated Markov models. , 1998, Nucleic acids research.

[19]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .