Current methods of gene prediction, their strengths and weaknesses.

While the genomes of many organisms have been sequenced over the last few years, transforming such raw sequence data into knowledge remains a hard task. A great number of prediction programs have been developed that try to address one part of this problem, which consists of locating the genes along a genome. This paper reviews the existing approaches to predicting genes in eukaryotic genomes and underlines their intrinsic advantages and limitations. The main mathematical models and computational algorithms adopted are also briefly described and the resulting software classified according to both the method and the type of evidence used. Finally, the several difficulties and pitfalls encountered by the programs are detailed, showing that improvements are needed and that new directions must be considered.

[1]  R. Bellman Dynamic programming. , 1957, Science.

[2]  Van Nostrand,et al.  Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm , 1967 .

[3]  S. Henikoff,et al.  Gene within a gene: Nested Drosophila genes encode unrelated proteins on opposite DNA strands , 1986, Cell.

[4]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[5]  G. Bernardi,et al.  The isochore organization of the human genome. , 1989, Annual review of genetics.

[6]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[7]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[8]  Chris A. Fields,et al.  gm: a practical tool for automating DNA sequence analysis , 1990, Comput. Appl. Biosci..

[9]  M S Gelfand,et al.  Computer prediction of the exon-intron structure of mammalian pre-mRNAs. , 1990, Nucleic acids research.

[10]  G. Bernardi,et al.  Gene distribution and isochore organization in the nuclear genome of plants. , 1990, Nucleic acids research.

[11]  A. Danchin,et al.  Evidence for horizontal gene transfer in Escherichia coli speciation. , 1991, Journal of molecular biology.

[12]  S. Knudsen,et al.  Prediction of human mRNA donor and acceptor sites from the DNA sequence. , 1991, Journal of molecular biology.

[13]  Michael R. Hayden,et al.  The prediction of exons through an analysis of spliceable open reading frames , 1992, Nucleic Acids Res..

[14]  Temple F. Smith,et al.  Prediction of gene structure. , 1992, Journal of molecular biology.

[15]  J. Fickett,et al.  Assessment of protein coding measures. , 1992, Nucleic acids research.

[16]  Mark Borodovsky,et al.  GENMARK: Parallel Gene Recognition for Both DNA Strands , 1993, Comput. Chem..

[17]  Michael Q. Zhang,et al.  A weight array method for splicing signal analysis , 1993, Comput. Appl. Biosci..

[18]  L. Duret,et al.  Strong conservation of non-coding sequences during vertebrates evolution: potential involvement in post-transcriptional regulation of gene expression. , 1993, Nucleic acids research.

[19]  Alexander E. Kel,et al.  GenViewer: A computing tool for protein-coding regions prediction in nucleotide sequences , 1993 .

[20]  D. Haussler,et al.  A hidden Markov model that finds genes in E. coli DNA. , 1994, Nucleic acids research.

[21]  M. Borodovsky,et al.  Intrinsic and extrinsic approaches for detecting genes in a bacterial genome. , 1994, Nucleic acids research.

[22]  M H Skolnick,et al.  A probabilistic model for detecting coding regions in DNA sequences. , 1994, IMA journal of mathematics applied in medicine and biology.

[23]  Ying Xu,et al.  Constructing gene models from accurately predicted exons: an application of dynamic programming , 1994, Comput. Appl. Biosci..

[24]  D. Searls,et al.  Gene structure prediction by linguistic methods. , 1994, Genomics.

[25]  G. Fichant,et al.  A frameshift error detection algorithm for DNA sequencing projects. , 1995, Nucleic acids research.

[26]  James W. Fickett,et al.  ORFs and Genes: How Strong a Connection? , 1995, J. Comput. Biol..

[27]  E. Snyder,et al.  Identification of protein coding regions in genomic DNA. , 1995, Journal of molecular biology.

[28]  M. Borodovsky,et al.  Detection of new genes in a bacterial genome using Markov models for three gene classes. , 1995, Nucleic acids research.

[29]  R. Guigó,et al.  Evaluation of gene structure prediction programs. , 1996, Genomics.

[30]  Luciano Milanesi,et al.  Gene structure prediction using information on homologous protein sequence , 1996, Comput. Appl. Biosci..

[31]  P. Pevzner,et al.  Gene recognition via spliced sequence alignment. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[32]  Kenneth H. Fasman,et al.  Finding Genes in Human DNA with a Hidden Markov Model , 1996, ISMB 1996.

[33]  Jerzy Jurka,et al.  Censor - a Program for Identification and Elimination of Repetitive Elements From DNA Sequences , 1996, Computers and Chemistry.

[34]  Peter G. Korning,et al.  Splice site prediction in Arabidopsis thaliana pre-mRNA by combining local and global sequence information. , 1996, Nucleic acids research.

[35]  David Haussler,et al.  A Generalized Hidden Markov Model for the Recognition of Human Genes in DNA , 1996, ISMB.

[36]  A. Prats,et al.  Translation of CUG- but not AUG-initiated forms of human fibroblast growth factor 2 is activated in transformed and stressed cells , 1996, The Journal of cell biology.

[37]  V. Brendel,et al.  Logitlinear models for the prediction of splice sites in plant pre-mRNA sequences. , 1996, Nucleic acids research.

[38]  J W Fickett,et al.  Finding genes by computer: the state of the art. , 1996, Trends in genetics : TIG.

[39]  T Gaasterland,et al.  Fully automated genome analysis that reflects user needs and preferences. A detailed introduction to the MAGPIE system architecture. , 1996, Biochimie.

[40]  James W. Fickett,et al.  The Gene Identification Problem: An Overview for Developers , 1995, Comput. Chem..

[41]  S Brunak,et al.  A branch point consensus from Arabidopsis found by non-circular analysis allows for better prediction of acceptor sites. , 1997, Nucleic acids research.

[42]  Michael Ruogu Zhang,et al.  Identification of protein coding regions in the human genome by quadratic discriminant analysis. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[43]  Anders Gorm Pedersen,et al.  Neural Network Prediction of Translation Initiation Sites in Eukaryotes: Perspectives for EST and Genome Analysis , 1997, ISMB.

[44]  Ewan Birney,et al.  Dynamite: A Flexible Code Generating Language for Dynamic Programming Methods Used in Sequence Comparison , 1997, ISMB.

[45]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[46]  Anders Krogh,et al.  Two Methods for Improving Performance of a HMM and their Application for Gene Finding , 1997, ISMB.

[47]  Steven Salzberg,et al.  Finding Genes in DNA with a Hidden Markov Model , 1997, J. Comput. Biol..

[48]  Richard Mott,et al.  EST_GENOME: a program to align spliced DNA sequences to unspliced genomic DNA , 1997, Comput. Appl. Biosci..

[49]  P J Shaw,et al.  Clusters of multiple different small nucleolar RNA genes in plants are expressed as and processed from polycistronic pre‐snoRNAs , 1997, The EMBO journal.

[50]  J. Claverie Computational methods for the identification of genes in vertebrate genomic sequences. , 1997, Human molecular genetics.

[51]  Mikhail S. Gelfand,et al.  Combinatorial Approaches to Gene Recognition , 1997, Comput. Chem..

[52]  Christopher B. Burge,et al.  Classification of Introns: U2-Type or U12-Type , 1997, Cell.

[53]  N. Harris,et al.  Genotator: a workbench for sequence annotation. , 1997, Genome research.

[54]  Roderic Guigó,et al.  Computational Gene Identification: An Open Problem , 1997, Comput. Chem..

[55]  M. Adams,et al.  A tool for analyzing and annotating genomic sequences. , 1997, Genomics.

[56]  Edward C. Uberbacher,et al.  Automated Gene Identification in Large-Scale Genomic Sequences , 1997, J. Comput. Biol..

[57]  Jean-Michel Claverie,et al.  The Difficulty of Identifying Genes in Anonymous Vertebrate Sequences , 1997, Comput. Chem..

[58]  Steven Salzberg,et al.  A method for identifying splice sites and translational start sites in eukaryotic mRNA , 1997, Comput. Appl. Biosci..

[59]  Victor V. Solovyev,et al.  The Gene-Finder Computer Tools for Analysis of Human and Model Organisms Genome Sequences , 1997, ISMB.

[60]  G. Danieli,et al.  Exon-intron organization of the human dystrophin gene. , 1997, Genomics.

[61]  David Haussler,et al.  Improved splice site detection in Genie , 1997, RECOMB '97.

[62]  H. Jacob,et al.  EbEST: an automated tool using expressed sequence tags to delineate gene structure. , 1998, Genome research.

[63]  Simon Kasif,et al.  Computational methods in molecular biology , 1998 .

[64]  S. Karlin,et al.  Finding the genes in genomic DNA. , 1998, Current opinion in structural biology.

[65]  R. Quatrano Genomics , 1998, Plant Cell.

[66]  S Audic,et al.  Self-identification of protein-coding regions in microbial genomes. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[67]  T. Smith,et al.  Functional genomics--bioinformatics is ready for the challenge. , 1998, Trends in genetics : TIG.

[68]  David Haussler,et al.  Computational Gene nding , 1998 .

[69]  S. Salzberg,et al.  Microbial gene identification using interpolated Markov models. , 1998, Nucleic acids research.

[70]  M. T. Laub,et al.  Finding Intron/Exon Splice Junctions Using INFO, Interruption Finder and Organizer , 1998, J. Comput. Biol..

[71]  Steven Salzberg,et al.  A Decision Tree System for Finding Genes in DNA , 1998, J. Comput. Biol..

[72]  Klaus Hermann,et al.  GeneGenerator - a flexible algorithm for gene prediction and its application to maize sequences , 1998, Bioinform..

[73]  T. Blumenthal Gene clusters and polycistronic transcription in eukaryotes , 1998, BioEssays : news and reviews in molecular, cellular and developmental biology.

[74]  K. Pfizenmaier,et al.  Cloning and characterization of promoter and 5'-UTR of the NMDA receptor subunit epsilon 2: evidence for alternative splicing of 5'-non-coding exon. , 1998, Gene.

[75]  G. Rubin,et al.  A computer program for aligning a cDNA sequence with a genomic DNA sequence. , 1998, Genome research.

[76]  Anders Krogh,et al.  Chapter 4 - An introduction to hidden Markov models for biological sequences , 1998 .

[77]  M. Borodovsky,et al.  How to interpret an anonymous bacterial genome: machine learning approach to gene identification. , 1998, Genome research.

[78]  V. Brendel,et al.  Prediction of splice sites in plant pre-mRNA from sequence properties. , 1998, Journal of molecular biology.

[79]  K. Murakami,et al.  Gene recognition by combination of several gene-finding programs , 1998, Bioinform..

[80]  G C Overton,et al.  Analysis of EST-driven gene annotation in human genomic sequence. , 1998, Genome research.

[81]  Roderic Guigó,et al.  Assembling Genes from Predicted Exons In Linear Time with Dynamic Programming , 1998, J. Comput. Biol..

[82]  M. Borodovsky,et al.  GeneMark.hmm: new solutions for gene finding. , 1998, Nucleic acids research.

[83]  S Audic,et al.  Alternate polyadenylation in human mRNAs: a large-scale analysis by EST clustering. , 1998, Genome research.

[84]  Victor V. Solovyev,et al.  INFOGENE: a database of known gene structures and predicted genes and proteins in sequences of genome sequencing projects , 1999, Nucleic Acids Res..

[85]  S. Eddy Noncoding RNA genes. , 1999, Current opinion in genetics & development.

[86]  Ramana V. Davuluri,et al.  Evaluation of gene prediction software using a genomic data set: application to <$O_SSF>Arabidopsis thaliana<$C_SSF>sequences , 1999, Bioinform..

[87]  S. Salzberg,et al.  Interpolated Markov models for eukaryotic gene finding. , 1999, Genomics.

[88]  M. Borodovsky,et al.  Heuristic approach to deriving models for gene finding. , 1999, Nucleic acids research.

[89]  R. Durbin,et al.  Comparative analysis of noncoding regions of 77 orthologous mouse and human gene pairs. , 1999, Genome research.

[90]  E Pennisi,et al.  Keeping Genome Databases Clean and Up to Date , 1999, Science.

[91]  C R Cantor,et al.  In silico detection of control signals: mRNA 3'-end-processing sequences in diverse species. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[92]  S. Salzberg,et al.  Improved microbial gene identification with GLIMMER. , 1999, Nucleic acids research.

[93]  Y. Hayashizaki,et al.  Prediction of human cDNA from its homologous mouse full‐length cDNA and human shotgun database , 1999, FEBS letters.

[94]  L Milanesi,et al.  Protein-coding regions prediction combining similarity searches and conservative evolutionary properties of protein-coding sequences. , 1999, Gene.

[95]  Pierre Baldi,et al.  The Biology of Eukaryotic Promoter Prediction - A Review , 1999, Comput. Chem..

[96]  S. Salzberg,et al.  Alignment of whole genomes. , 1999, Nucleic acids research.

[97]  C. V. Jongeneel,et al.  ESTScan: A Program for Detecting, Evaluating, and Reconstructing Potential Coding Regions in EST Sequences , 1999, ISMB.

[98]  P. Rouzé,et al.  Genome annotation: which tools do we have for it? , 1999, Current opinion in plant biology.

[99]  M. Kozak Initiation of translation in prokaryotes and eukaryotes. , 1999, Gene.

[100]  Valentin I. Spitkovsky,et al.  A dictionary-based approach for gene annotation. , 1999 .

[101]  Gunnar Rätsch,et al.  Engineering Support Vector Machine Kerneis That Recognize Translation Initialion Sites , 2000, German Conference on Bioinformatics.

[102]  J. Micol,et al.  OTC and AUL1, two convergent and overlapping genes in the nuclear genome of Arabidopsis thaliana , 1999, FEBS letters.

[103]  M. Van Montagu,et al.  Classification of Arabidopsis thaliana gene sequences: clustering of coding sequences into two groups according to codon usage improves gene prediction. , 1999, Journal of molecular biology.

[104]  E. Meyerowitz,et al.  Non-AUG Initiation of AGAMOUS mRNA Translation in Arabidopsis thaliana , 1999, Molecular and Cellular Biology.

[105]  Thomas Schiex,et al.  EUGÈNE: An Eukaryotic Gene Finder That Combines Several Sources of Evidence , 2000, JOBIM.

[106]  Tetsuo Nishikawa,et al.  Prediction whether a human cDNA sequence contains initiation codon by combining statistical information and similarity with protein sequences , 2000, Bioinform..

[107]  The Arabidopsis Genome Initiative Analysis of the genome sequence of the flowering plant Arabidopsis thaliana , 2000, Nature.

[108]  V. Solovyev,et al.  Analysis of canonical and non-canonical splice sites in mammalian genomes. , 2000, Nucleic acids research.

[109]  R. Guigó,et al.  An assessment of gene prediction accuracy in large DNA sequences. , 2000, Genome research.

[110]  R. Gibbs,et al.  PipMaker--a web server for aligning two genomic DNA sequences. , 2000, Genome research.

[111]  B. Berger,et al.  Human and Mouse Gene Structure: Comparative Analysis and Application to Exon Prediction , 2000 .

[112]  Eugene W. Myers,et al.  A whole-genome assembly of Drosophila. , 2000, Science.

[113]  P. Rouzé,et al.  Gene prediction and gene classes in Arabidopsis thaliana. , 2000, Journal of biotechnology.

[114]  David Baker,et al.  Detection of Protein Coding Sequences Using a Mixture Model for Local Protein Amino Acid Sequence , 2000, J. Comput. Biol..

[115]  Wei Zhu,et al.  Optimal spliced alignment of homologous cDNA to a genomic DNA template , 2000, Bioinform..

[116]  W. J. Kent,et al.  Conservation, regulation, synteny, and introns in a large-scale C. briggsae-C. elegans genomic alignment. , 2000, Genome research.

[117]  V. Brendel,et al.  Gene structure prediction by spliced alignment of genomic DNA with protein sequences: increased accuracy by differential splice site scoring. , 2000, Journal of molecular biology.

[118]  David S. Wishart,et al.  Prediction of genetic structure in eukaryotic DNA using reference point logistic regression and sequence alignment , 2000, Bioinform..

[119]  Kevin Burrage,et al.  ISIS, the intron information system, reveals the high frequency of alternative splicing in the human genome , 2000, Nature Genetics.

[120]  W. Makałowski,et al.  Genomic scrap yard: how genomes utilize all that junk. , 2000, Gene.

[121]  Elena Rivas,et al.  Secondary structure alone is generally not statistically significant for the detection of noncoding RNAs , 2000, Bioinform..

[122]  Burkhard Morgenstern,et al.  A space-efficient algorithm for aligning large genomic sequences , 2000, Bioinform..

[123]  Osamu Gotoh,et al.  Homology-based gene structure prediction: simplified matching algorithm using a translated codon (tron) and improved accuracy by allowing for long gaps , 2000, Bioinform..

[124]  Maciej Szymanski,et al.  Non-coding, mRNA-like RNAs database Y2K , 2000, Nucleic Acids Res..

[125]  Daniel H. Huson,et al.  The Conserved Exon Method for Gene Finding , 2000, ISMB.

[126]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[127]  Michael Q. Zhang,et al.  Computational identification of promoters and first exons in the human genome , 2001, Nature Genetics.

[128]  D. Church,et al.  Spidey: a tool for mRNA-to-genomic alignments. , 2001, Genome research.

[129]  S. Salzberg,et al.  GeneSplicer: a new computational method for splice site prediction. , 2001, Nucleic acids research.

[130]  Mikhail S. Gelfand,et al.  Gene recognition in eukaryotic DNA by comparison of genomic sequences , 2001, Bioinform..

[131]  W. Gish,et al.  Gene structure prediction and alternative splicing analysis using genomically aligned ESTs. , 2001, Genome research.

[132]  Webb Miller,et al.  Comparison of genomic DNA sequences: solved and unsolved problems , 2001, Bioinform..

[133]  A. Krainer,et al.  Pre-mRNA splicing in the new millennium. , 2001, Current opinion in cell biology.

[134]  R. Guigó,et al.  SGP-1: prediction and validation of homologous genes based on sequence alignments. , 2001, Genome research.

[135]  J. Blake,et al.  Creating the Gene Ontology Resource : Design and Implementation The Gene Ontology Consortium 2 , 2001 .

[136]  Alan K. Mackworth,et al.  Evaluation of gene-finding programs on mammalian sequences. , 2001, Genome research.

[137]  Dan Roth,et al.  Gene recognition based on DAG shortest paths , 2001, ISMB.

[138]  M. Kreitman,et al.  Analysis of conserved noncoding DNA in Drosophila reveals similar constraints in intergenic and intronic sequences. , 2001, Genome research.

[139]  Ian Korf,et al.  Integrating genomic homology into gene structure prediction , 2001, ISMB.

[140]  Yangrae Cho,et al.  Computational methods for gene annotation: the Arabidopsis genome. , 2001, Current opinion in biotechnology.

[141]  Christopher J. Lee,et al.  Genome-wide detection of alternative splicing in expressed sequences of human genes , 2001, Nucleic Acids Res..

[142]  M. Borodovsky,et al.  GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. , 2001, Nucleic acids research.

[143]  Mikhail S. Gelfand,et al.  Pro-Frame: similarity-based gene recognition in eukaryotic DNA sequences with errors , 2001, Bioinform..

[144]  Ramana V. Davuluri,et al.  Identifying the 3'-terminal exon in human DNA , 2001, Bioinform..

[145]  C. Burge,et al.  Computational inference of homologous gene structures in the human genome. , 2001, Genome research.

[146]  Alvis Brazma,et al.  On the Importance of Standardisation in Life Sciences , 2001, Bioinform..

[147]  H Niemann,et al.  Identification and analysis of eukaryotic promoters: recent computational approaches. , 2001, Trends in genetics : TIG.

[148]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[149]  Simon Cawley,et al.  Applications of generalized pair hidden Markov models to alignment and gene finding problems. , 2002 .

[150]  Donald J. Patterson,et al.  Pre-mRNA Secondary Structure Prediction Aids Splice Site Prediction , 2001, Pacific Symposium on Biocomputing.

[151]  Huanming Yang,et al.  A Draft Sequence of the Rice Genome (Oryza sativa L. ssp. japonica) , 2002, Science.

[152]  Vladimir Pavlovic,et al.  A Bayesian framework for combining gene predictions , 2002, Bioinform..

[153]  C. Gissi,et al.  Untranslated regions of mRNAs , 2002, Genome Biology.

[154]  J. Weissenbach,et al.  Genome sequence of the plant pathogen Ralstonia solanacearum , 2002, Nature.

[155]  B. Goldman,et al.  Genome Sequence of the Plant Pathogen and Biotechnology Agent Agrobacterium tumefaciens C58 , 2001, Science.

[156]  Christopher J. Lee,et al.  A genomic view of alternative splicing , 2002, Nature Genetics.

[157]  T. Hubbard,et al.  Computational detection and location of transcription start sites in mammalian genomic DNA. , 2002, Genome research.

[158]  Huanming Yang,et al.  A Draft Sequence of the Rice Genome (Oryza sativa L. ssp. indica) , 2002, Science.

[159]  Philip Lijnzaad,et al.  The Ensembl genome database project , 2002, Nucleic Acids Res..

[160]  Pierre Rouzé,et al.  Orphan gene finding - an exon assembly approach , 2003, Theor. Comput. Sci..

[161]  L. Duret,et al.  Statistical analysis of vertebrate sequences reveals that long genes are scarce in GC-rich isochores , 1995, Journal of Molecular Evolution.