A Generalized Hidden Markov Model for the Recognition of Human Genes in DNA

We present a statistical model of genes in DNA. A Generalized Hidden Markov Model (GHMM) provides the framework for describing the grammar of a legal parse of a DNA sequence (Stormo & Haussler 1994). Probabilities are assigned to transitions between states in the GHMM and to the generation of each nucleotide base given a particular state. Machine learning techniques are applied to optimize these probabilities using a standardized training set. Given a new candidate sequence, the best parse is deduced from the model using a dynamic programming algorithm to identify the path through the model with maximum probability. The GHMM is flexible and modular, so new sensors and additional states can be inserted easily. In addition, it provides simple solutions for integrating cardinality constraints, reading frame constraints, "indels", and homology searching. The description and results of an implementation of such a gene-finding model, called Genie, is presented. The exon sensor is a codon frequency model conditioned on windowed nucleotide frequency and the preceding codon. Two neural networks are used, as in (Brunak, Engelbrecht, & Knudsen 1991), for splice site prediction. We show that this simple model performs quite well. For a cross-validated standard test set of 304 genes [ftp:@www-hgc.lbl.gov/pub/genesets] in human DNA, our gene-finding system identified up to 85% of protein-coding bases correctly with a specificity of 80%. 58% of exons were exactly identified with a specificity of 51%. Genie is shown to perform favorably compared with several other gene-finding systems.

[1]  L. Rabiner,et al.  An introduction to hidden Markov models , 1986, IEEE ASSP Magazine.

[2]  J. Hawkins,et al.  A survey on intron and exon lengths. , 1988, Nucleic acids research.

[3]  C. Lawrence,et al.  Algorithms for the optimal identification of segment neighborhoods , 1989 .

[4]  S. Knudsen,et al.  Prediction of human mRNA donor and acceptor sites from the DNA sequence. , 1991, Journal of molecular biology.

[5]  Temple F. Smith,et al.  Prediction of gene structure. , 1992, Journal of molecular biology.

[6]  D. Sankoff Efficient optimal decomposition of a sequence into disjoint regions, each matched to some template in an inventory. , 1992, Mathematical biosciences.

[7]  J. Fickett,et al.  Assessment of protein coding measures. , 1992, Nucleic acids research.

[8]  Mark Borodovsky,et al.  GENMARK: Parallel Gene Recognition for Both DNA Strands , 1993, Comput. Chem..

[9]  M. Gelfand,et al.  Prediction of the exon-intron structure by a dynamic programming approach. , 1993, Bio Systems.

[10]  E. Snyder,et al.  Identification of coding regions in genomic DNA sequences: an application of dynamic programming and neural networks. , 1993, Nucleic acids research.

[11]  D. Haussler,et al.  A hidden Markov model that finds genes in E. coli DNA. , 1994, Nucleic acids research.

[12]  Yin Xu,et al.  An Improved System for Exon Recognition and Gene Modeling in Human DNA Sequence , 1994, ISMB.

[13]  V. Solovyev,et al.  Predicting internal exons by oligonucleotide composition and discriminant analysis of spliceable open reading frames. , 1994, Nucleic acids research.

[14]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[15]  David Haussler,et al.  Optimally Parsing a Sequence into Different Classes Based on Multiple Types of Evidence , 1994, ISMB.

[16]  D. Searls,et al.  Gene structure prediction by linguistic methods. , 1994, Genomics.

[17]  M. G. Reese,et al.  NOVEL NEURAL NETWORK PREDICTION SYSTEMS FOR HUMAN PROMOTERS AND SPLICE SITES , 1995 .

[18]  R. Guigó,et al.  Evaluation of gene structure prediction programs. , 1996, Genomics.

[19]  S. Eddy Hidden Markov models. , 1996, Current opinion in structural biology.

[20]  Yoshua Bengio,et al.  Neural networks for speech and sequence recognition , 1996 .