Using Multiple Alignments to Improve Gene Prediction

The multiple species de novo gene prediction problem can be stated as follows: given an alignment of genomic sequences from two or more organisms, predict the location and structure of all protein-coding genes in one or more of the sequences. Here, we present a new system, N-SCAN (a.k.a. TWINSCAN 3.0), for addressing this problem. N-SCAN has the ability to model dependencies between the aligned sequences, context-dependent substitution rates, and insertions and deletions in the sequences. An implementation of N-SCAN was created and used to generate predictions for the entire human genome. An analysis of the predictions reveals that N-SCAN's predictive accuracy in human exceeds that of all previously published whole-genome de novo gene predictors. In addition, predictions were generated for the genome of the fruit fly Drosophila melanogaster to demonstrate the applicability of N-SCAN to invertebrate gene prediction.

[1]  D. Haussler,et al.  Aligning multiple genomic sequences with the threaded blockset aligner. , 2004, Genome research.

[2]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[3]  M. Bulmer,et al.  Neighboring base effects on substitution rates in pseudogenes. , 1986, Molecular biology and evolution.

[4]  Jakob Skou Pedersen,et al.  Gene finding with a hidden Markov model of genome structure and evolution , 2003, Bioinform..

[5]  D. Haussler,et al.  Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. , 2003, Molecular biology and evolution.

[6]  Charles J. Vaske,et al.  Gene prediction and verification in a compact genome with numerous small introns. , 2004, Genome research.

[7]  Ian Korf,et al.  Integrating genomic homology into gene structure prediction , 2001, ISMB.

[8]  P. Lio’,et al.  Models of molecular evolution and phylogeny. , 1998, Genome research.

[9]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[10]  Ryan D. Morin,et al.  The status, quality, and expansion of the NIH full-length cDNA project: the Mammalian Gene Collection (MGC). , 2004, Genome research.

[11]  R. Guigó,et al.  Comparative gene prediction in human and mouse. , 2003, Genome research.

[12]  Colin N. Dewey,et al.  Initial sequencing and comparative analysis of the mouse genome. , 2002 .

[13]  Samuel S. Gross,et al.  Begin at the beginning: predicting genes with 5' UTRs. , 2005, Genome research.

[14]  M. Brent,et al.  Leveraging the mouse genome for gene prediction in human: from whole-genome shotgun reads to a global synteny map. , 2003, Genome research.

[15]  Lior Pachter,et al.  Multiple-sequence functional annotation and the generalized hidden Markov phylogeny , 2004, Bioinform..

[16]  L. Pachter,et al.  SLAM: cross-species gene finding and alignment with a generalized pair hidden Markov model. , 2003, Genome research.

[17]  David Haussler,et al.  Computational identification of evolutionarily conserved exons , 2004, RECOMB.

[18]  Tom H. Pringle,et al.  The human genome browser at UCSC. , 2002, Genome research.