Evaluation of gene-finding programs on mammalian sequences.

We present an independent comparative analysis of seven recently developed gene-finding programs: FGENES, GeneMark.hmm, Genie, Genescan, HMMgene, Morgan, and MZEF. For evaluation purposes we developed a new, thoroughly filtered, and biologically validated dataset of mammalian genomic sequences that does not overlap with the training sets of the programs analyzed. Our analysis shows that the new generation of programs has substantially better results than the programs analyzed in previous studies. The accuracy of the programs was also examined as a function of various sequence and prediction features, such as G + C content of the sequence, length and type of exons, signal type, and score of the exon prediction. This approach pinpoints the strengths and weaknesses of each individual program as well as those of computational gene-finding in general. The dataset used in this analysis (HMR195) as well as the tables with the complete results are available at http://www.cs.ubc.ca/~rogic/evaluation/.

[1]  Samuel Karlin,et al.  A First Course on Stochastic Processes , 1968 .

[2]  B. Wieringa,et al.  A minimal intron length but no specific internal sequence is required for splicing the large rabbit β-globin intron , 1984, Cell.

[3]  J. Hawkins,et al.  A survey on intron and exon lengths. , 1988, Nucleic acids research.

[4]  R. Schulz,et al.  Overlapping genes of Drosophila melanogaster: organization of the z600-gonadal-Eip28/29 gene cluster. , 1989, Genes & development.

[5]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[6]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[7]  R Kole,et al.  Selection of splice sites in pre-mRNAs with short internal exons , 1991, Molecular and cellular biology.

[8]  Temple F. Smith,et al.  Prediction of gene structure. , 1992, Journal of molecular biology.

[9]  B. Dujon,et al.  The complete DNA sequence of yeast chromosome III , 1992, Nature.

[10]  J. Fickett,et al.  Assessment of protein coding measures. , 1992, Nucleic acids research.

[11]  P. Green,et al.  Ancient conserved regions in new gene sequences and the protein databases. , 1993, Science.

[12]  G. Bernardi,et al.  The isochore organization of the human genome and its evolutionary history--a review. , 1993, Gene.

[13]  H. Prydz,et al.  Evaluation of the exon predictions of the GRAIL software. , 1994, Genomics.

[14]  Yin Xu,et al.  An Improved System for Exon Recognition and Gene Modeling in Human DNA Sequence , 1994, ISMB.

[15]  D. Searls,et al.  Gene structure prediction by linguistic methods. , 1994, Genomics.

[16]  M. Gouy,et al.  HOVERGEN: a database of homologous vertebrate genes. , 1994, Nucleic acids research.

[17]  R. Durbin,et al.  2.2 Mb of contiguous nucleotide sequence from chromosome III of C. elegans , 1994, Nature.

[18]  E. Snyder,et al.  Identification of protein coding regions in genomic DNA. , 1995, Journal of molecular biology.

[19]  Victor V. Solovyev,et al.  Identification of Human Gene Structure Using Linear Discriminant Functions and Dynamic Programming , 1995, ISMB.

[20]  J W Fickett,et al.  Distinctive sequence features in protein coding genic non-coding, and intergenic human DNA. , 1995, Journal of molecular biology.

[21]  G Bernardi,et al.  The gene distribution of the human genome. , 1996, Gene.

[22]  R. Guigó,et al.  Evaluation of gene structure prediction programs. , 1996, Genomics.

[23]  P. Pevzner,et al.  Gene recognition via spliced sequence alignment. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[24]  David Haussler,et al.  A Generalized Hidden Markov Model for the Recognition of Human Genes in DNA , 1996, ISMB.

[25]  G. Schuler,et al.  Entrez: molecular biology database and retrieval system. , 1996, Methods in enzymology.

[26]  James W. Fickett,et al.  The Gene Identification Problem: An Overview for Developers , 1995, Comput. Chem..

[27]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[28]  Michael Ruogu Zhang,et al.  Identification of protein coding regions in the human genome by quadratic discriminant analysis. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[29]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[30]  J. Claverie Computational methods for the identification of genes in vertebrate genomic sequences. , 1997, Human molecular genetics.

[31]  A. Krogh Two methods for improving performance of an HMM application for gene finding , 1997 .

[32]  M. Adams,et al.  A tool for analyzing and annotating genomic sequences. , 1997, Genomics.

[33]  S. Karlin,et al.  Finding the genes in genomic DNA. , 1998, Current opinion in structural biology.

[34]  Steven Salzberg,et al.  A Decision Tree System for Finding Genes in DNA , 1998, J. Comput. Biol..

[35]  Tetsuo Nishikawa,et al.  Assessing protein coding region integrity in cDNA sequencing projects , 1998, Bioinform..

[36]  G. Rubin,et al.  A computer program for aligning a cDNA sequence with a genomic DNA sequence. , 1998, Genome research.

[37]  N. Nowak,et al.  Divergently transcribed overlapping genes expressed in liver and kidney and located in the 11p15.5 imprinted domain. , 1998, Genomics.

[38]  M. Borodovsky,et al.  GeneMark.hmm: new solutions for gene finding. , 1998, Nucleic acids research.

[39]  Melanie E. Goward,et al.  The DNA sequence of human chromosome 22 , 1999, Nature.

[40]  R George,et al.  An exploration of the sequence of a 2.9-Mb region of the genome of Drosophila melanogaster: the Adh region. , 1999, Genetics.

[41]  M. Gelfand,et al.  Frequent alternative splicing of human genes. , 1999, Genome research.

[42]  J. Zhang,et al.  Protein-length distributions for the three domains of life. , 2000, Trends in genetics : TIG.