Microbial gene identification using interpolated Markov models.

This paper describes a new system, GLIMMER, for finding genes in microbial genomes. In a series of tests on Haemophilus influenzae , Helicobacter pylori and other complete microbial genomes, this system has proven to be very accurate at locating virtually all the genes in these sequences, outperforming previous methods. A conservative estimate based on experiments on H.pylori and H. influenzae is that the system finds >97% of all genes. GLIMMER uses interpolated Markov models (IMMs) as a framework for capturing dependencies between nearby nucleotides in a DNA sequence. An IMM-based method makes predictions based on a variable context; i.e., a variable-length oligomer in a DNA sequence. The context used by GLIMMER changes depending on the local composition of the sequence. As a result, GLIMMER is more flexible and more powerful than fixed-order Markov methods, which have previously been the primary content-based technique for finding genes in microbial DNA.

[1]  Jack Perkins,et al.  Pattern recognition in practice , 1980 .

[2]  Michael Gates Roberts,et al.  Local order estimating Markovian analysis for noiseless source coding and authorship identification , 1982 .

[3]  JORMA RISSANEN,et al.  A universal data compression system , 1983, IEEE Trans. Inf. Theory.

[4]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[5]  Ross N. Williams,et al.  Adaptive Data Compression , 1990 .

[6]  Mark Borodovsky,et al.  GENMARK: Parallel Gene Recognition for Both DNA Strands , 1993, Comput. Chem..

[7]  R. Fleischmann,et al.  Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. , 1995, Science.

[8]  W. Pearson Comparison of methods for searching protein sequence databases , 1995, Protein science : a publication of the Protein Society.

[9]  M. Borodovsky,et al.  Detection of new genes in a bacterial genome using Markov models for three gene classes. , 1995, Nucleic acids research.

[10]  Stacey P. Memberg,et al.  Regeneration of adult axons in white matter tracts of the central nervous system , 1997, Nature.

[11]  Eric Sven Ristad,et al.  Nonuniform Markov models , 1996, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  S Falkow,et al.  Microbial pathogenesis: genomics and beyond. , 1997, Science.

[13]  Mark Borodovsky,et al.  The complete genome sequence of the gastric pathogen Helicobacter pylori , 1997, Nature.

[14]  Axthonv G. Oettinger,et al.  IEEE Transactions on Information Theory , 1998 .