Computational gene prediction using multiple sources of evidence.

This article describes a computational method to construct gene models by using evidence generated from a diverse set of sources, including those typical of a genome annotation pipeline. The program, called Combiner, takes as input a genomic sequence and the locations of gene predictions from ab initio gene finders, protein sequence alignments, expressed sequence tag and cDNA alignments, splice site predictions, and other evidence. Three different algorithms for combining evidence in the Combiner were implemented and tested on 1783 confirmed genes in Arabidopsis thaliana. Our results show that combining gene prediction evidence consistently outperforms even the best individual gene finder and, in some cases, can produce dramatic improvements in sensitivity and specificity.

[1]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[2]  Simon Kasif,et al.  A System for Induction of Oblique Decision Trees , 1994, J. Artif. Intell. Res..

[3]  Ewan Birney,et al.  Dynamite: A Flexible Code Generating Language for Dynamic Programming Methods Used in Sequence Comparison , 1997, ISMB.

[4]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[5]  M. Adams,et al.  A tool for analyzing and annotating genomic sequences. , 1997, Genomics.

[6]  Steven Salzberg,et al.  A Decision Tree System for Finding Genes in DNA , 1998, J. Comput. Biol..

[7]  K. Murakami,et al.  Gene recognition by combination of several gene-finding programs , 1998, Bioinform..

[8]  M. Borodovsky,et al.  GeneMark.hmm: new solutions for gene finding. , 1998, Nucleic acids research.

[9]  Lukas Wagner,et al.  A Greedy Algorithm for Aligning DNA Sequences , 2000, J. Comput. Biol..

[10]  Thomas Schiex,et al.  EUGÈNE: An Eukaryotic Gene Finder That Combines Several Sources of Evidence , 2000, JOBIM.

[11]  The Arabidopsis Genome Initiative Analysis of the genome sequence of the flowering plant Arabidopsis thaliana , 2000, Nature.

[12]  S. Salzberg,et al.  GeneSplicer: a new computational method for splice site prediction. , 2001, Nucleic acids research.

[13]  B. Haas,et al.  Full-length messenger RNA sequences greatly improve genome annotation , 2002, Genome Biology.

[14]  Daniel Lee,et al.  The TIGR Gene Indices: analysis of gene transcript sequences in highly sampled eukaryotic species , 2001, Nucleic Acids Res..

[15]  C. Burge,et al.  Computational inference of homologous gene structures in the human genome. , 2001, Genome research.

[16]  Alan K. Mackworth,et al.  Improving gene recognition accuracy by combining predictions from two gene-finding programs , 2002, Bioinform..

[17]  R. Durbin,et al.  GAZE: a generic framework for the integration of gene-prediction data by dynamic programming. , 2002, Genome research.

[18]  Vladimir Pavlovic,et al.  A Bayesian framework for combining gene predictions , 2002, Bioinform..

[19]  Ewan Birney,et al.  Databases and tools for browsing genomes. , 2002, Annual review of genomics and human genetics.

[20]  Michael Q. Zhang Computational prediction of eukaryotic protein-coding genes , 2002, Nature Reviews Genetics.

[21]  M. Brent,et al.  Leveraging the mouse genome for gene prediction in human: from whole-genome shotgun reads to a global synteny map. , 2003, Genome research.

[22]  S. Salzberg,et al.  Computational gene finding in plants , 2004, Plant Molecular Biology.