Gene prediction with a hidden Markov model and a new intron submodel

MOTIVATION The problem of finding the genes in eukaryotic DNA sequences by computational methods is still not satisfactorily solved. Gene finding programs have achieved relatively high accuracy on short genomic sequences but do not perform well on longer sequences with an unknown number of genes in them. Here existing programs tend to predict many false exons. RESULTS We have developed a new program, AUGUSTUS, for the ab initio prediction of protein coding genes in eukaryotic genomes. The program is based on a Hidden Markov Model and integrates a number of known methods and submodels. It employs a new way of modeling intron lengths. We use a new donor splice site model, a new model for a short region directly upstream of the donor splice site model that takes the reading frame into account and apply a method that allows better GC-content dependent parameter estimation. AUGUSTUS predicts on longer sequences far more human and drosophila genes accurately than the ab initio gene prediction programs we compared it with, while at the same time being more specific. AVAILABILITY A web interface for AUGUSTUS and the executable program are located at http://augustus.gobics.de.

[1]  Burkhard Morgenstern,et al.  DIALIGN2: Improvement of the segment to segment approach to multiple sequence alignment , 1999, German Conference on Bioinformatics.

[2]  Daniel H. Huson,et al.  The Conserved Exon Method for Gene Finding , 2000, ISMB.

[3]  David Haussler,et al.  A Generalized Hidden Markov Model for the Recognition of Human Genes in DNA , 1996, ISMB.

[4]  Anders Krogh,et al.  Two Methods for Improving Performance of a HMM and their Application for Gene Finding , 1997, ISMB.

[5]  R. Guigó,et al.  An assessment of gene prediction accuracy in large DNA sequences. , 2000, Genome research.

[6]  G. K. Wong,et al.  Most of the human genome is transcribed. , 2001, Genome research.

[7]  Durbin,et al.  Biological Sequence Analysis , 1998 .

[8]  Ewan Birney,et al.  Dynamite: A Flexible Code Generating Language for Dynamic Programming Methods Used in Sequence Comparison , 1997, ISMB.

[9]  V. Solovyev,et al.  Analysis of canonical and non-canonical splice sites in mammalian genomes. , 2000, Nucleic acids research.

[10]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[11]  Paul Levi,et al.  GENIO/scan - EST Guided Identification of Genes in Human Genomic DNA , 1998, German Conference on Bioinformatics.

[12]  Richard Durbin,et al.  Comparative ab initio prediction of gene structures using pair HMMs , 2002, Bioinform..

[13]  D. Haussler,et al.  Genie--gene finding in Drosophila melanogaster. , 2000, Genome research.

[14]  R. Guigó,et al.  GeneID in Drosophila. , 2000, Genome research.

[15]  R. Guigó,et al.  SGP-1: prediction and validation of homologous genes based on sequence alignments. , 2001, Genome research.

[16]  T. Richmond Gene recognition via spliced alignment , 2000, Genome Biology.

[17]  Ian Korf,et al.  Integrating genomic homology into gene structure prediction , 2001, ISMB.

[18]  Gerald M. Rubin,et al.  Computational prediction of gene structure and regulation in the genome of Drosophila melanogaster , 2000 .

[19]  C. Burge,et al.  Computational inference of homologous gene structures in the human genome. , 2001, Genome research.

[20]  C. Burge,et al.  A computational analysis of sequence features involved in recognition of short introns , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[21]  P. Rouzé,et al.  Current methods of gene prediction, their strengths and weaknesses. , 2002, Nucleic acids research.

[22]  Burkhard Morgenstern,et al.  Exon discovery by genomic sequence alignment , 2002, Bioinform..

[23]  S. Salzberg,et al.  Microbial gene identification using interpolated Markov models. , 1998, Nucleic acids research.

[24]  Mark Borodovsky,et al.  GENMARK: Parallel Gene Recognition for Both DNA Strands , 1993, Comput. Chem..

[25]  S. Lewis,et al.  Genome annotation assessment in Drosophila melanogaster. , 2000, Genome research.