Vertebrate gene finding from multiple-species alignments using a two-level strategy

BackgroundOne way in which the accuracy of gene structure prediction in vertebrate DNA sequences can be improved is by analyzing alignments with multiple related species, since functional regions of genes tend to be more conserved.ResultsWe describe DOGFISH, a vertebrate gene finder consisting of a cleanly separated site classifier and structure predictor. The classifier scores potential splice sites and other features, using sequence alignments between multiple vertebrate species, while the structure predictor hypothesizes coding transcripts by combining these scores using a simple model of gene structure. This also identifies and assigns confidence scores to possible additional exons. Performance is assessed on the ENCODE regions. We predict transcripts and exons across the whole human genome, and identify over 10,000 high confidence new coding exons not in the Ensembl gene set.ConclusionWe present a practical multiple species gene prediction method. Accuracy improves as additional species, up to at least eight, are introduced. The novel predictions of the whole-genome scan should support efficient experimental verification.

[1]  Peter G. Korning,et al.  Splice Site Prediction in Arabidopsis Thaliana Pre-mRNA by Combining Local and Global Sequence Information , 1996 .

[2]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[3]  Anders Krogh,et al.  Two Methods for Improving Performance of a HMM and their Application for Gene Finding , 1997, ISMB.

[4]  J. C. BurgesChristopher A Tutorial on Support Vector Machines for Pattern Recognition , 1998 .

[5]  P. Lio’,et al.  Models of molecular evolution and phylogeny. , 1998, Genome research.

[6]  Jill P. Mesirov,et al.  Human and mouse gene structure: comparative analysis and application to exon prediction , 2000, RECOMB '00.

[7]  S. Salzberg,et al.  GeneSplicer: a new computational method for splice site prediction. , 2001, Nucleic acids research.

[8]  R. Guigó,et al.  SGP-1: prediction and validation of homologous genes based on sequence alignments. , 2001, Genome research.

[9]  Korbinian Strimmer,et al.  PAL: an object-oriented programming library for molecular evolution and phylogenetics , 2001, Bioinform..

[10]  Ian Korf,et al.  Integrating genomic homology into gene structure prediction , 2001, ISMB.

[11]  George Eastman House,et al.  Sparse Bayesian Learning and the Relevan e Ve tor Ma hine , 2001 .

[12]  Samuel Karlin,et al.  Genes, pseudogenes, and Alu sequence organization across human chromosomes 21 and 22 , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Gunnar Rätsch,et al.  New Methods for Splice Site Recognition , 2002, ICANN.

[14]  Richard Durbin,et al.  Comparative ab initio prediction of gene structures using pair HMMs , 2002, Bioinform..

[15]  H. Mangalam,et al.  The Bio* toolkits--a brief overview. , 2002, Briefings in bioinformatics.

[16]  T. Hubbard,et al.  Computational detection and location of transcription start sites in mammalian genomic DNA. , 2002, Genome research.

[17]  R. Sorek,et al.  Intronic sequences flanking alternatively spliced exons are conserved between human and mouse. , 2003, Genome research.

[18]  A. Hobolth,et al.  Applications of hidden Markov models for comparative gene structure prediction , 2003 .

[19]  Alexander Sczyrba,et al.  AGenDA: homology-based gene prediction , 2003, Bioinform..

[20]  L. Pachter,et al.  SLAM: cross-species gene finding and alignment with a generalized pair hidden Markov model. , 2003, Genome research.

[21]  Mario Stanke,et al.  Gene prediction with a hidden Markov model and a new intron submodel , 2003, ECCB.

[22]  R. Guigó,et al.  Comparative gene prediction in human and mouse. , 2003, Genome research.

[23]  T. Speed,et al.  Biological Sequence Analysis , 1998 .

[24]  Steven Salzberg,et al.  GlimmerM, Exonomy and Unveil: three ab initio eukaryotic genefinders , 2003, Nucleic Acids Res..

[25]  K. Heller,et al.  Sequence information for the splicing of human pre-mRNA identified by support vector machine classification. , 2003, Genome research.

[26]  Loi Sy Ho,et al.  Splice site detection with a higher-order markov model implemented on a neural network. , 2003, Genome informatics. International Conference on Genome Informatics.

[27]  Christopher B. Burge,et al.  Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals , 2003, RECOMB '03.

[28]  R. Durbin,et al.  GeneWise and Genomewise. , 2004, Genome research.

[29]  Jonathan E. Allen,et al.  Computational gene prediction using multiple sources of evidence. , 2003, Genome research.

[30]  Lior Pachter,et al.  Multiple organism gene finding by collapsed gibbs sampling , 2004, RECOMB.

[31]  I. Ovcharenko,et al.  eShadow: a tool for comparing closely related sequences. , 2004, Genome research.

[32]  M. Brent,et al.  Recent advances in gene structure prediction. , 2004, Current opinion in structural biology.

[33]  Jotun Hein,et al.  A nucleotide substitution model with nearest-neighbour interactions , 2004, ISMB/ECCB.

[34]  Irmtraud M. Meyer,et al.  Gene structure conservation aids similarity based gene prediction. , 2004, Nucleic acids research.

[35]  D. Haussler,et al.  Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. , 2003, Molecular biology and evolution.

[36]  Bonnie Berger,et al.  Methods in Comparative Genomics: Genome Correspondence, Gene Identification and Regulatory Motif Discovery , 2004, J. Comput. Biol..

[37]  S Brunak,et al.  Analysis and recognition of 5 ¢ UTR intron splice sites in human pre-mRNA , 2003 .

[38]  D. Haussler,et al.  Aligning multiple genomic sequences with the threaded blockset aligner. , 2004, Genome research.

[39]  Lior Pachter,et al.  Multiple-sequence functional annotation and the generalized hidden Markov phylogeny , 2004, Bioinform..

[40]  Bradley I. Coleman,et al.  An intermediate grade of finished genomic sequence suitable for comparative analyses. , 2004, Genome research.

[41]  Gajendra P.S. Raghava,et al.  EGPred: prediction of eukaryotic genes using ab initio methods after combining with sequence similarity approaches. , 2004, Genome research.

[42]  David Haussler,et al.  Computational identification of evolutionarily conserved exons , 2004, RECOMB.

[43]  Robert Castelo,et al.  Splice site identification by idlBNs , 2004, ISMB/ECCB.

[44]  Asger Hobolth,et al.  Applications of Hidden Markov Models for Characterization of Homologous DNA Sequences with a Common Gene , 2005, J. Comput. Biol..

[45]  Kihoon Yoon,et al.  A filtering Approach to splice Site Predictions in Human genes , 2005, Advances in Bioinformatics and Its Applications.

[46]  M. Bodén,et al.  The applicability of recurrent neural networks for biological sequence analysis , 2005, IEEE/ACM Transactions on Computational Biology & Bioinformatics.

[47]  Samuel S. Gross,et al.  Begin at the beginning: predicting genes with 5' UTRs. , 2005, Genome research.

[48]  Philipp Kapranov,et al.  Examples of the complex architecture of the human transcriptome revealed by RACE and high-density tiling arrays. , 2005, Genome research.

[49]  Daniel G. Brown,et al.  ExonHunter: a comprehensive approach to gene finding , 2005, ISMB.

[50]  Thomas A. Down,et al.  Relevance Vector Machines for classifying points and regions in biological sequences. , 2008 .