Gene structure prediction from consensus spliced alignment of multiple ESTs matching the same genomic locus

MOTIVATION Accurate gene structure annotation is a challenging computational problem in genomics. The best results are achieved with spliced alignment of full-length cDNAs or multiple expressed sequence tags (ESTs) with sufficient overlap to cover the entire gene. For most species, cDNA and EST collections are far from comprehensive. We sought to overcome this bottleneck by exploring the possibility of using combined EST resources from fairly diverged species that still share a common gene space. Previous spliced alignment tools were found inadequate for this task because they rely on very high sequence similarity between the ESTs and the genomic DNA. RESULTS We have developed a computer program, GeneSeqer, which is capable of aligning thousands of ESTs with a long genomic sequence in a reasonable amount of time. The algorithm is uniquely designed to tolerate a high percentage of mismatches and insertions or deletions in the EST relative to the genomic template. This feature allows use of non-cognate ESTs for gene structure prediction, including ESTs derived from duplicated genes and homologous genes from related species. The increased gene prediction sensitivity results in part from novel splice site prediction models that are also available as a stand-alone splice site prediction tool. We assessed GeneSeqer performance relative to a standard Arabidopsis thaliana gene set and demonstrate its utility for plant genome annotation. In particular, we propose that this method provides a timely tool for the annotation of the rice genome, using abundant ESTs from other cereals and plants. AVAILABILITY The source code is available for download at http://bioinformatics.iastate.edu/bioinformatics2go/gs/download.html. Web servers for Arabidopsis and other plant species are accessible at http://www.plantgdb.org/cgi-bin/AtGeneSeqer.cgi and http://www.plantgdb.org/cgi-bin/GeneSeqer.cgi, respectively. For non-plant species, use http://bioinformatics.iastate.edu/cgi-bin/gs.cgi. The splice site prediction tool (SplicePredictor) is distributed with the GeneSeqer code. A SplicePredictor web server is available at http://bioinformatics.iastate.edu/cgi-bin/sp.cgi

[1]  S. Salzberg,et al.  GeneSplicer: a new computational method for splice site prediction. , 2001, Nucleic acids research.

[2]  M. Long,et al.  Intron-exon structures of eukaryotic model organisms. , 1999, Nucleic acids research.

[3]  J. Bouck,et al.  Comparison of gene indexing databases. , 1999, Trends in genetics : TIG.

[4]  B. Gaut Patterns of chromosomal duplication in maize and their implications for comparative maps of the grasses. , 2001, Genome research.

[5]  Carol Soderlund,et al.  Information contents and dinucleotide compositions of plant intron sequences vary with evolutionary origin , 1992, Plant Molecular Biology.

[6]  Alan K. Mackworth,et al.  Evaluation of gene-finding programs on mammalian sequences. , 2001, Genome research.

[7]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[8]  GeneSeqer@PlantGDB: Gene structure prediction in plant genomes. , 2003, Nucleic acids research.

[9]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[10]  V. Brendel,et al.  Refined Annotation of the Arabidopsis Genome by Complete Expressed Sequence Tag Mapping1 , 2003, Plant Physiology.

[11]  G. Rubin,et al.  A computer program for aligning a cDNA sequence with a genomic DNA sequence. , 1998, Genome research.

[12]  Richard Mott,et al.  EST_GENOME: a program to align spliced DNA sequences to unspliced genomic DNA , 1997, Comput. Appl. Biosci..

[13]  Christopher J. Lee,et al.  A genomic view of alternative splicing , 2002, Nature Genetics.

[14]  Srinivas Aluru,et al.  Efficient clustering of large EST data sets on parallel computers. , 2003, Nucleic acids research.

[15]  B. Haas,et al.  Full-length messenger RNA sequences greatly improve genome annotation , 2002, Genome Biology.

[16]  Ramana V. Davuluri,et al.  Evaluation of gene prediction software using a genomic data set: application to <$O_SSF>Arabidopsis thaliana<$C_SSF>sequences , 1999, Bioinform..

[17]  P. Sharp,et al.  Codon usage and genome evolution. , 1994, Current opinion in genetics & development.

[18]  Wei Zhu,et al.  Optimal spliced alignment of homologous cDNA to a genomic DNA template , 2000, Bioinform..

[19]  M. Boguski,et al.  dbEST — database for “expressed sequence tags” , 1993, Nature Genetics.

[20]  W. Gish,et al.  Gene structure prediction and alternative splicing analysis using genomically aligned ESTs. , 2001, Genome research.

[21]  Peter G. Korning,et al.  Splice Site Prediction in Arabidopsis Thaliana Pre-mRNA by Combining Local and Global Sequence Information , 1996 .

[22]  S. Salzberg,et al.  An optimized protocol for analysis of EST sequences. , 2000, Nucleic acids research.

[23]  Steven Salzberg,et al.  A method for identifying splice sites and translational start sites in eukaryotic mRNA , 1997, Comput. Appl. Biosci..

[24]  P. Pevzner,et al.  Gene recognition via spliced sequence alignment. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[25]  John Quackenbush,et al.  TIGR Gene Indices clustering tools (TGICL): a software system for fast clustering of large EST datasets , 2003, Bioinform..

[26]  Meena Kishore Sakharkar,et al.  ExInt: an Exon Intron Database , 2002, Nucleic Acids Res..

[27]  V. Brendel,et al.  Prediction of locally optimal splice sites in plant pre-mRNA with applications to gene identification in Arabidopsis thaliana genomic DNA. , 1998, Nucleic acids research.

[28]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[29]  S. Knudsen,et al.  Prediction of human mRNA donor and acceptor sites from the DNA sequence. , 1991, Journal of molecular biology.

[30]  D. Church,et al.  Spidey: a tool for mRNA-to-genomic alignments. , 2001, Genome research.

[31]  P. Rouzé,et al.  Current methods of gene prediction, their strengths and weaknesses. , 2002, Nucleic acids research.

[32]  V. Brendel,et al.  Gene structure prediction by spliced alignment of genomic DNA with protein sequences: increased accuracy by differential splice site scoring. , 2000, Journal of molecular biology.

[33]  E. Snyder,et al.  Identification of protein coding regions in genomic DNA. , 1995, Journal of molecular biology.

[34]  K. Murakami,et al.  Gene recognition by combination of several gene-finding programs , 1998, Bioinform..

[35]  Stephen M. Mount,et al.  Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. , 2003, Nucleic acids research.

[36]  M. Adams,et al.  A tool for analyzing and annotating genomic sequences. , 1997, Genomics.

[37]  Shinichi Morishita,et al.  Fast and sensitive algorithm for aligning ESTs to human genome , 2002, Proceedings. IEEE Computer Society Bioinformatics Conference.

[38]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[39]  P. Bork,et al.  Alternative splicing and genome complexity , 2002, Nature Genetics.

[40]  K. Hokamp,et al.  A recent polyploidy superimposed on older large-scale duplications in the Arabidopsis genome. , 2003, Genome research.

[41]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[42]  M. Gelfand,et al.  Frequent alternative splicing of human genes. , 1999, Genome research.

[43]  Wei Zhu,et al.  Identification, characterization and molecular phylogeny of U12-dependent introns in the Arabidopsis thaliana genome. , 2003, Nucleic acids research.