Using ESTs to improve the accuracy of de novo gene prediction

BackgroundESTs are a tremendous resource for determining the exon-intron structures of genes, but even extensive EST sequencing tends to leave many exons and genes untouched. Gene prediction systems based exclusively on EST alignments miss these exons and genes, leading to poor sensitivity. De novo gene prediction systems, which ignore ESTs in favor of genomic sequence, can predict such "untouched" exons, but they are less accurate when predicting exons to which ESTs align. TWINSCAN is the most accurate de novo gene finder available for nematodes and N-SCAN is the most accurate for mammals, as measured by exact CDS gene prediction and exact exon prediction.ResultsTWINSCAN_EST is a new system that successfully combines EST alignments with TWINSCAN. On the whole C. elegans genome TWINSCAN_EST shows 14% improvement in sensitivity and 13% in specificity in predicting exact gene structures compared to TWINSCAN without EST alignments. Not only are the structures revealed by EST alignments predicted correctly, but these also constrain the predictions without alignments, improving their accuracy. For the human genome, we used the same approach with N-SCAN, creating N-SCAN_EST. On the whole genome, N-SCAN_EST produced a 6% improvement in sensitivity and 1% in specificity of exact gene structure predictions compared to N-SCAN.ConclusionTWINSCAN_EST and N-SCAN_EST are more accurate than TWINSCAN and N-SCAN, while retaining their ability to discover novel genes to which no ESTs align. Thus, we recommend using the EST versions of these programs to annotate any genome for which EST information is available.TWINSCAN_EST and N-SCAN_EST are part of the TWINSCAN open source software package http://genes.cse.wustl.edu/distribution/download_TS.html.

[1]  Samuel S. Gross,et al.  Begin at the beginning: predicting genes with 5' UTRs. , 2005, Genome research.

[2]  M. Brent,et al.  Iterative gene prediction and pseudogene removal improves genome annotation. , 2006, Genome research.

[3]  Colin N. Dewey,et al.  Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution , 2004, Nature.

[4]  A. Krogh,et al.  Using database matches with for HMMGene for automated gene detection in Drosophila. , 2000, Genome research.

[5]  Donna R. Maglott,et al.  RefSeq and LocusLink: NCBI gene-centered resources , 2001, Nucleic Acids Res..

[6]  K. Katz,et al.  Introducing RefSeq and LocusLink: curated human genome resources at the NCBI. , 2000, Trends in genetics : TIG.

[7]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[8]  Lisa M. D'Souza,et al.  Genome sequence of the Brown Norway rat yields insights into mammalian evolution , 2004, Nature.

[9]  Ian Korf,et al.  Integrating genomic homology into gene structure prediction , 2001, ISMB.

[10]  R. Guigó,et al.  EGASP: collaboration through competition to find human genes , 2005, Nature Methods.

[11]  Juancarlos Chan,et al.  WormBase: a cross-species database for comparative genomics , 2003, Nucleic Acids Res..

[12]  D. Haussler,et al.  Genie--gene finding in Drosophila melanogaster. , 2000, Genome research.

[13]  Chaochun Wei,et al.  Closing in on the C. elegans ORFeome by cloning TWINSCAN predictions. , 2005, Genome research.

[14]  R. Durbin,et al.  GeneWise and Genomewise. , 2004, Genome research.

[15]  Mouse Genome Sequencing Consortium Initial sequencing and comparative analysis of the mouse genome , 2002, Nature.

[16]  Burkhard Morgenstern,et al.  Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources , 2006, BMC Bioinformatics.

[17]  Steven Salzberg,et al.  JIGSAW: integration of multiple sources of evidence for gene prediction , 2005, Bioinform..

[18]  M. Boguski,et al.  dbEST — database for “expressed sequence tags” , 1993, Nature Genetics.

[19]  Thomas Schiex,et al.  Integrating alternative splicing detection into gene prediction , 2005, BMC Bioinformatics.

[20]  Jonathan E. Allen,et al.  Computational gene prediction using multiple sources of evidence. , 2003, Genome research.

[21]  David Haussler,et al.  Computational identification of evolutionarily conserved exons , 2004, RECOMB.

[22]  Kimberly Van Auken,et al.  WormBase: a multi-species resource for nematode biology and genomics , 2004, Nucleic Acids Res..

[23]  V. Solovyev,et al.  Ab initio gene finding in Drosophila genomic DNA. , 2000, Genome research.

[24]  M. Brent,et al.  Pairagon+N-SCAN_EST: a model-based gene annotation pipeline , 2006, Genome Biology.

[25]  Paul W. Sternberg,et al.  WormBase: network access to the genome and biology of Caenorhabditis elegans , 2001, Nucleic Acids Res..

[26]  Colin N. Dewey,et al.  Initial sequencing and comparative analysis of the mouse genome. , 2002 .

[27]  M. Brent,et al.  Comparison of mouse and human genomes followed by experimental verification yields an estimated 1,019 additional genes , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[28]  Michael R Brent,et al.  Genome annotation past, present, and future: how to define an ORF at each locus. , 2005, Genome research.

[29]  Michael R. Brent,et al.  Using Multiple Alignments to Improve Gene Prediction , 2005, RECOMB.

[30]  Donna R. Maglott,et al.  NCBI's LocusLink and RefSeq , 2000, Nucleic Acids Res..

[31]  Tao Jiang,et al.  Finding Genes by Computer: Probabilistic and Discriminative Approaches , 2002 .

[32]  Ryan D. Morin,et al.  The status, quality, and expansion of the NIH full-length cDNA project: the Mammalian Gene Collection (MGC). , 2004, Genome research.

[33]  R. Durbin,et al.  GAZE: a generic framework for the integration of gene-prediction data by dynamic programming. , 2002, Genome research.

[34]  M. Brent,et al.  Leveraging the mouse genome for gene prediction in human: from whole-genome shotgun reads to a global synteny map. , 2003, Genome research.

[35]  International Human Genome Sequencing Consortium Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution , 2004 .

[36]  Richard Mott,et al.  EST_GENOME: a program to align spliced DNA sequences to unspliced genomic DNA , 1997, Comput. Appl. Biosci..

[37]  Jennifer Daub,et al.  Expressed sequence tags: medium-throughput protocols. , 2004, Methods in molecular biology.