Gene Prediction Methods

Most computational gene-finding methods in current use are derived from the fields of natural language processing and speech recognition. These latter fields are concerned with parsing spoken or written language into functional components such as nouns, verbs, and phrases of various types. The parsing task is governed by a set of syntax rules that dictate which linguistic elements may immediately follow each other in well-formed sentences – for example, $$subject \rightarrow verb,\, verb \rightarrow direct\, object,\, etc\ldots$$ The problem of gene-finding is rather similar to linguistic parsing in that we wish to partition a sequence of letters into elements of biological relevance, such as exons, introns, and the intergenic regions separating genes. That is, we wish to not only find the genes, but also to predict their internal exon-intron structure so that the encoded protein(s) may be deduced. Figure 5.1 illustrates this internal structure for a typical gene.

[1]  Ian Korf,et al.  Integrating genomic homology into gene structure prediction , 2001, ISMB.

[2]  Piero Fariselli,et al.  A new decoding algorithm for hidden Markov models improves the prediction of the topology of all-beta membrane proteins , 2005, BMC Bioinformatics.

[3]  William H. Majoros,et al.  Methods for computational gene prediction , 2007 .

[4]  H. Munro,et al.  Mammalian protein metabolism , 1964 .

[5]  Ewan Birney,et al.  Automated generation of heuristics for biological sequence comparison , 2005, BMC Bioinformatics.

[6]  C. Fizames,et al.  Estimate of human gene number provided by genome-wide analysis using Tetraodon nigroviridis DNA sequence , 2000, Nature Genetics.

[7]  David Haussler,et al.  Exploiting Generative Models in Discriminative Classifiers , 1998, NIPS.

[8]  Chuong B. Do,et al.  CONTRAST: a discriminative, phylogeny-free approach to multiple informant de novo gene prediction , 2007, Genome Biology.

[9]  R. Guigó,et al.  Comparative gene prediction in human and mouse. , 2003, Genome research.

[10]  Burkhard Morgenstern,et al.  Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources , 2006, BMC Bioinformatics.

[11]  David Haussler,et al.  Targeted discovery of novel human exons by comparative genomics. , 2007, Genome research.

[12]  David J. States,et al.  Identification of protein coding regions by database similarity search , 1993, Nature Genetics.

[13]  D. Haussler,et al.  Genie--gene finding in Drosophila melanogaster. , 2000, Genome research.

[14]  D. Haussler,et al.  Aligning multiple genomic sequences with the threaded blockset aligner. , 2004, Genome research.

[15]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[16]  S. Cawley,et al.  Novel RNAs identified from an in-depth analysis of the transcriptome of human chromosomes 21 and 22. , 2004, Genome research.

[17]  Bin Ma,et al.  Patternhunter Ii: Highly Sensitive and Fast Homology Search , 2004, J. Bioinform. Comput. Biol..

[18]  Dalong Ma,et al.  Nested genes in the human genome. , 2005, Genomics.

[19]  E. Birney,et al.  EGASP: the human ENCODE Genome Annotation Assessment Project , 2006, Genome Biology.

[20]  S. Salzberg,et al.  Versatile and open software for comparing large genomes , 2004, Genome Biology.

[21]  J. Ohlrogge,et al.  Sampling the Arabidopsis Transcriptome with Massively Parallel Pyrosequencing1[W][OA] , 2007, Plant Physiology.

[22]  David Haussler,et al.  A Generalized Hidden Markov Model for the Recognition of Human Genes in DNA , 1996, ISMB.

[23]  Shivakundan Singh Tej,et al.  Analysis of the transcriptional complexity of Arabidopsis thaliana by massively parallel signature sequencing , 2004, Nature Biotechnology.

[24]  Gunnar Rätsch,et al.  PALMA: mRNA to genome alignments using large margin algorithms , 2007, Bioinform..

[25]  Charles E. Chapple,et al.  Diversity and functional plasticity of eukaryotic selenoproteins: identification and characterization of the SelJ family. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[26]  Stephen M. Mount,et al.  The genome sequence of Drosophila melanogaster. , 2000, Science.

[27]  Steven Salzberg,et al.  Efficient decoding algorithms for generalized hidden Markov model gene finders , 2005, BMC Bioinformatics.

[28]  M. Borodovsky,et al.  Gene identification in novel eukaryotic genomes by self-training algorithm , 2005, Nucleic acids research.

[29]  Mario Stanke,et al.  Gene prediction with a hidden Markov model and a new intron submodel , 2003, ECCB.

[30]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[31]  Miao Zhang,et al.  Improved spliced alignment from an information theoretic approach , 2006, Bioinform..

[32]  D. Haussler,et al.  Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. , 2003, Molecular biology and evolution.

[33]  Simon Cawley,et al.  HMM sampling and applications to gene finding and alternative splicing , 2003, ECCB.

[34]  Richard Durbin,et al.  Comparative ab initio prediction of gene structures using pair HMMs , 2002, Bioinform..

[35]  Ian Korf,et al.  Gene finding in novel genomes , 2004, BMC Bioinformatics.

[36]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[37]  Tom H. Pringle,et al.  The human genome browser at UCSC. , 2002, Genome research.

[38]  W R Pearson,et al.  Comparison of DNA sequences with protein sequences. , 1997, Genomics.

[39]  Michael Q. Zhang,et al.  A weight array method for splicing signal analysis , 1993, Comput. Appl. Biosci..

[40]  A. Pavesi,et al.  On the Informational Content of Overlapping Genes in Prokaryotic and Eukaryotic Viruses , 1997, Journal of Molecular Evolution.

[41]  Ian Holmes,et al.  Evolutionary HMMs: a Bayesian approach to multiple alignment , 2001, Bioinform..

[42]  Stephen M. Mount,et al.  Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. , 2003, Nucleic acids research.

[43]  R. Durbin,et al.  Pfam: A comprehensive database of protein domain families based on seed alignments , 1997, Proteins.

[44]  K. Akiyama,et al.  Functional Annotation of a Full-Length Arabidopsis cDNA Collection , 2002, Science.

[45]  Daphne Koller,et al.  Restricted Bayes Optimal Classifiers , 2000, AAAI/IAAI.

[46]  Steven Salzberg,et al.  JIGSAW: integration of multiple sources of evidence for gene prediction , 2005, Bioinform..

[47]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[48]  Koby Crammer,et al.  Global Discriminative Learning for Higher-Accuracy Computational Gene Prediction , 2007, PLoS Comput. Biol..

[49]  Steven Salzberg,et al.  A phylogenetic generalized hidden Markov model for predicting alternatively spliced exons , 2006, Algorithms for Molecular Biology.

[50]  Erik L. L. Sonnhammer,et al.  An HMM posterior decoder for sequence feature prediction that includes homology information , 2005, ISMB.

[51]  V. Solovyev,et al.  Ab initio gene finding in Drosophila genomic DNA. , 2000, Genome research.

[52]  Jakob Skou Pedersen,et al.  Gene finding with a hidden Markov model of genome structure and evolution , 2003, Bioinform..

[53]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[54]  Lisa M. D'Souza,et al.  Genome sequence of the Brown Norway rat yields insights into mammalian evolution , 2004, Nature.

[55]  William H. Majoros,et al.  Efficient implementation of a generalized pair hidden Markov model for comparative gene finding , 2005, Bioinform..

[56]  Steven Salzberg,et al.  TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders , 2004, Bioinform..

[57]  G. Sutton,et al.  Gene and alternative splicing annotation with AIR. , 2005, Genome research.

[58]  Jonathan E. Allen,et al.  JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the features of human genes in the ENCODE regions , 2006, Genome Biology.

[59]  Mark Gerstein,et al.  Global Identification and Characterization of Transcriptionally Active Regions in the Rice Genome , 2007, PloS one.

[60]  L. Pachter,et al.  SLAM: cross-species gene finding and alignment with a generalized pair hidden Markov model. , 2003, Genome research.

[61]  J. Galagan,et al.  Conrad: gene prediction using conditional random fields. , 2007, Genome research.

[62]  T. Jukes CHAPTER 24 – Evolution of Protein Molecules , 1969 .

[63]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[64]  Sofia M. C. Robb,et al.  MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. , 2007, Genome research.

[65]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[66]  M. Borodovsky,et al.  GeneMark.hmm: new solutions for gene finding. , 1998, Nucleic acids research.

[67]  Philip Lijnzaad,et al.  The Ensembl genome database project , 2002, Nucleic Acids Res..

[68]  G. Rubin,et al.  A computer program for aligning a cDNA sequence with a genomic DNA sequence. , 1998, Genome research.

[69]  William H. Majoros,et al.  Macronuclear Genome Sequence of the Ciliate Tetrahymena thermophila, a Model Eukaryote , 2006, PLoS biology.

[70]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[71]  G. Rubin,et al.  Generation and initial analysis of more than 15,000 full-length human and mouse cDNA sequences , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[72]  Mark Yandell,et al.  A computational and experimental approach to validating annotations and gene predictions in the Drosophila melanogaster genome. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[73]  Chaochun Wei,et al.  Using ESTs to improve the accuracy of de novo gene prediction , 2006, BMC Bioinformatics.

[74]  Michael I. Jordan,et al.  On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[75]  Irmtraud M. Meyer,et al.  Gene structure conservation aids similarity based gene prediction. , 2004, Nucleic acids research.

[76]  S. Cawley,et al.  Phat--a gene finding program for Plasmodium falciparum. , 2001, Molecular and biochemical parasitology.

[77]  Lior Pachter,et al.  MAVID: constrained ancestral alignment of multiple sequences. , 2003, Genome research.

[78]  Anders Krogh,et al.  Two Methods for Improving Performance of a HMM and their Application for Gene Finding , 1997, ISMB.

[79]  R. Durbin,et al.  GeneWise and Genomewise. , 2004, Genome research.

[80]  Michael B. Eisen,et al.  Phylogenetic Motif Detection by Expectation-Maximization on Evolutionary Mixtures , 2003, Pacific Symposium on Biocomputing.

[81]  R. Guigó,et al.  GeneID in Drosophila. , 2000, Genome research.