Advancing the State of the Art in Computational Gene Prediction

Current methods for computationally predicting the locations and intron-exon structures of protein-coding genes in eukaryotic DNA are largely based on probabilistic, state-based generative models such as hidden Markov models and their various extensions. Unfortunately, little attention has been paid to the optimality of these models for the gene-parsing problem. Furthermore, as the prevalence of alternative splicing in human genes becomes more apparent, the "one gene, one parse" discipline endorsed by virtually all current gene-finding systems becomes less attractive from a biomedical perspective. Because our ability to accurately identify all the isoforms of each gene in the genome is of direct importance to biomedicine, our ability to improve gene-finding accuracy both for human and non-human DNA clearly has a potential to significantly impact human health. In this paper we review current methods and suggest a number of possible directions for further research that may alleviate some of these problems and ultimately lead to better and more useful gene predictions.

[1]  R Staden Computer methods to locate signals in nucleic acid sequences , 1984, Nucleic Acids Res..

[2]  Yan-Da Li,et al.  Identifying splicing sites in eukaryotic RNA: support vector machine approach , 2003, Comput. Biol. Medicine.

[3]  E. Birney,et al.  EGASP: the human ENCODE Genome Annotation Assessment Project , 2006, Genome Biology.

[4]  Piero Fariselli,et al.  A new decoding algorithm for hidden Markov models improves the prediction of the topology of all-beta membrane proteins , 2005, BMC Bioinformatics.

[5]  Christopher B. Burge,et al.  Identification and analysis of alternative splicing events conserved in human and mouse Gene , 2005 .

[6]  J. C. Clemens,et al.  Alternative Splicing of Drosophila Dscam Generates Axon Guidance Receptors that Exhibit Isoform-Specific Homophilic Binding , 2004, Cell.

[7]  David Haussler,et al.  Improved splice site detection in Genie , 1997, RECOMB '97.

[8]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[9]  Gunnar Rätsch,et al.  Engineering Support Vector Machine Kerneis That Recognize Translation Initialion Sites , 2000, German Conference on Bioinformatics.

[10]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[11]  R. Durbin,et al.  Biological sequence analysis: Background on probability , 1998 .

[12]  Piero Fariselli,et al.  The posterior-Viterbi: a new decoding algorithm for hidden Markov models , 2005 .

[13]  Andrew McCallum,et al.  Gene Prediction with Conditional Random Fields , 2005 .

[14]  Charles E. Chapple,et al.  Diversity and functional plasticity of eukaryotic selenoproteins: identification and characterization of the SelJ family. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[15]  S. Salzberg,et al.  Computational gene finding in plants , 2004, Plant Molecular Biology.

[16]  Ian Korf,et al.  Gene finding in novel genomes , 2004, BMC Bioinformatics.

[17]  Michael Ashburner,et al.  Annotation of the Drosophila melanogaster euchromatic genome: a systematic review , 2002, Genome Biology.

[18]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[19]  Steven Salzberg,et al.  An empirical analysis of training protocols for probabilistic gene finders , 2005, BMC Bioinformatics.

[20]  Michael Q. Zhang,et al.  Computational identification of promoters and first exons in the human genome , 2001, Nature Genetics.

[21]  William H. Majoros,et al.  Efficient implementation of a generalized pair hidden Markov model for comparative gene finding , 2005, Bioinform..

[22]  A. Reymond,et al.  Tandem chimerism as a means to increase protein complexity in the human genome. , 2005, Genome research.

[23]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[24]  David F. Burke,et al.  PROVAT – a versatile tool for Voronoi tessellation analysis of protein structures and complexes , 2005, BMC Bioinformatics.

[25]  S. Salzberg,et al.  Improved microbial gene identification with GLIMMER. , 1999, Nucleic acids research.

[26]  Ron Shamir,et al.  Accurate identification of alternatively spliced exons using support vector machine , 2005, Bioinform..

[27]  David Haussler,et al.  Exploiting Generative Models in Discriminative Classifiers , 1998, NIPS.

[28]  Jotun Hein,et al.  Using hidden Markov models and observed evolution to annotate viral genomes , 2006, Bioinform..

[29]  Jonathan E. Allen,et al.  JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the features of human genes in the ENCODE regions , 2006, Genome Biology.

[30]  Thangavel Alphonse Thanaraj,et al.  ASD: the Alternative Splicing Database , 2004, Nucleic Acids Res..

[31]  Ian Korf,et al.  MaskerAid : a performance enhancement to RepeatMasker , 2000, Bioinform..

[32]  S. Salzberg,et al.  Interpolated Markov models for eukaryotic gene finding. , 1999, Genomics.

[33]  Kuldip K. Paliwal,et al.  Automatic Speech and Speaker Recognition , 1996 .

[34]  Haixu Tang,et al.  Splicing graphs and EST assembly problem , 2002, ISMB.

[35]  A. Krogh Two methods for improving performance of an HMM application for gene finding , 1997 .

[36]  David Haussler,et al.  A Generalized Hidden Markov Model for the Recognition of Human Genes in DNA , 1996, ISMB.

[37]  David Haussler,et al.  Computational identification of evolutionarily conserved exons , 2004, RECOMB.

[38]  Simon Cawley,et al.  HMM sampling and applications to gene finding and alternative splicing , 2003, ECCB.

[39]  Michael Q. Zhang,et al.  A weight array method for splicing signal analysis , 1993, Comput. Appl. Biosci..

[40]  Michael R. Brent,et al.  Using Multiple Alignments to Improve Gene Prediction , 2005, RECOMB.

[41]  Christopher B. Burge,et al.  Recognition of Unknown Conserved Alternatively Spliced Exons , 2005, PLoS Comput. Biol..

[42]  Yves Normandin Maximum Mutual Information Estimation of Hidden Markov Models , 1996 .

[43]  E. Uberbacher,et al.  Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[44]  Steven Salzberg,et al.  Efficient decoding algorithms for generalized hidden Markov model gene finders , 2005, BMC Bioinformatics.

[45]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[46]  Mario Stanke,et al.  Gene prediction with a hidden Markov model and a new intron submodel , 2003, ECCB.

[47]  Erik L. L. Sonnhammer,et al.  An HMM posterior decoder for sequence feature prediction that includes homology information , 2005, ISMB.

[48]  L. Pachter,et al.  SLAM: cross-species gene finding and alignment with a generalized pair hidden Markov model. , 2003, Genome research.

[49]  Terrence S. Furey,et al.  The UCSC Genome Browser Database , 2003, Nucleic Acids Res..

[50]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[51]  Felix L. Chernousko,et al.  Finding prokaryotic genes by the 'frame-by-frame' algorithm: targeting gene starts and overlapping genes , 1999, Bioinform..

[52]  Gene W. Yeo,et al.  Systematic Identification and Analysis of Exonic Splicing Silencers , 2004, Cell.

[53]  Lalit R. Bahl,et al.  Maximum mutual information estimation of hidden Markov model parameters for speech recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[54]  Gunnar Rätsch,et al.  RASE: recognition of alternatively spliced exons in C.elegans , 2005, ISMB.

[55]  Günther Ruske,et al.  Discriminative training for continuous speech recognition , 1995, EUROSPEECH.