Refined Annotation of the Arabidopsis Genome by Complete Expressed Sequence Tag Mapping1

Expressed sequence tags (ESTs) currently encompass more entries in the public databases than any other form of sequence data. Thus, EST data sets provide a vast resource for gene identification and expression profiling. We have mapped the complete set of 176,915 publicly available Arabidopsis EST sequences onto the Arabidopsis genome using GeneSeqer, a spliced alignment program incorporating sequence similarity and splice site scoring. About 96% of the available ESTs could be properly aligned with a genomic locus, with the remaining ESTs deriving from organelle genomes and non-Arabidopsis sources or displaying insufficient sequence quality for alignment. The mapping provides verified sets of EST clusters for evaluation of EST clustering programs. Analysis of the spliced alignments suggests corrections to current gene structure annotation and provides examples of alternative and non-canonical pre-mRNA splicing. All results of this study were parsed into a database and are accessible via a flexible Web interface at http://www.plantgdb.org/AtGDB/.

[1]  Charles Elkan,et al.  Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.

[2]  S. Berget Exon Recognition in Vertebrate Splicing (*) , 1995, The Journal of Biological Chemistry.

[3]  A. Krainer,et al.  U1-Mediated Exon Definition Interactions Between AT-AC and GT-AG Introns , 1996, Science.

[4]  P. Pevzner,et al.  Gene recognition via spliced sequence alignment. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[5]  M. Montagu,et al.  Non–canonical introns are at least 109 years old , 1996, Nature Genetics.

[6]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[7]  R. Padgett,et al.  Terminal intron dinucleotide sequences do not distinguish between U2- and U12-dependent introns. , 1997, Molecular cell.

[8]  Richard Mott,et al.  EST_GENOME: a program to align spliced DNA sequences to unspliced genomic DNA , 1997, Comput. Appl. Biosci..

[9]  Christopher B. Burge,et al.  Classification of Introns: U2-Type or U12-Type , 1997, Cell.

[10]  M. Adams,et al.  A tool for analyzing and annotating genomic sequences. , 1997, Genomics.

[11]  Michael Gribskov,et al.  Combining evidence using p-values: application to sequence homology searches , 1998, Bioinform..

[12]  G. Rubin,et al.  A computer program for aligning a cDNA sequence with a genomic DNA sequence. , 1998, Genome research.

[13]  V. Brendel,et al.  Prediction of locally optimal splice sites in plant pre-mRNA with applications to gene identification in Arabidopsis thaliana genomic DNA. , 1998, Nucleic acids research.

[14]  P. Sharp,et al.  Evolutionary fates and origins of U12-type introns. , 1998, Molecular cell.

[15]  Ramana V. Davuluri,et al.  Evaluation of gene prediction software using a genomic data set: application to <$O_SSF>Arabidopsis thaliana<$C_SSF>sequences , 1999, Bioinform..

[16]  J. Bouck,et al.  Comparison of gene indexing databases. , 1999, Trends in genetics : TIG.

[17]  Adrian R. Krainer,et al.  AT-AC Pre-mRNA Splicing Mechanisms and Conservation of Minor Introns in Voltage-Gated Ion Channel Genes , 1999, Molecular and Cellular Biology.

[18]  The Arabidopsis Genome Initiative Analysis of the genome sequence of the flowering plant Arabidopsis thaliana , 2000, Nature.

[19]  V. Solovyev,et al.  Analysis of canonical and non-canonical splice sites in mammalian genomes. , 2000, Nucleic acids research.

[20]  Michael Ruogu Zhang,et al.  CART classification of human 5' UTR sequences. , 2000, Genome research.

[21]  D. Black Protein Diversity from Alternative Splicing A Challenge for Bioinformatics and Post-Genome Biology , 2000, Cell.

[22]  Graziano Pesole,et al.  UTRdb and UTRsite: specialized databases of sequences and functional elements of 5' and 3' untranslated regions of eukaryotic mRNAs , 2000, Nucleic Acids Res..

[23]  김삼묘,et al.  “Bioinformatics” 특집을 내면서 , 2000 .

[24]  Wei Zhu,et al.  Optimal spliced alignment of homologous cDNA to a genomic DNA template , 2000, Bioinform..

[25]  V. Brendel,et al.  Gene structure prediction by spliced alignment of genomic DNA with protein sequences: increased accuracy by differential splice site scoring. , 2000, Journal of molecular biology.

[26]  D. Church,et al.  Spidey: a tool for mRNA-to-genomic alignments. , 2001, Genome research.

[27]  W. Gish,et al.  Gene structure prediction and alternative splicing analysis using genomically aligned ESTs. , 2001, Genome research.

[28]  R. Durbin,et al.  A computational scan for U12-dependent introns in the human genome sequence. , 2001, Nucleic acids research.

[29]  Victor V. Solovyev,et al.  SpliceDB: database of canonical and non-canonical mammalian splice sites , 2001, Nucleic Acids Res..

[30]  B. Haas,et al.  Full-length messenger RNA sequences greatly improve genome annotation , 2002, Genome Biology.

[31]  Daniel Lee,et al.  The TIGR Gene Indices: analysis of gene transcript sequences in highly sampled eukaryotic species , 2001, Nucleic Acids Res..

[32]  Ramana V. Davuluri,et al.  Identifying the 3'-terminal exon in human DNA , 2001, Bioinform..

[33]  C. Burge,et al.  Computational inference of homologous gene structures in the human genome. , 2001, Genome research.

[34]  Martin Vingron,et al.  SpliceNest: visualizing gene structure and alternative splicing based on EST clusters , 2002 .

[35]  Michael Ruogu Zhang,et al.  Computational identification of promoters and first exons in the human genome , 2002, Nature Genetics.

[36]  K. Akiyama,et al.  Functional Annotation of a Full-Length Arabidopsis cDNA Collection , 2002, Science.

[37]  P. Bork,et al.  Alternative splicing and genome complexity , 2002, Nature Genetics.

[38]  V. Brendel,et al.  Comparison of RNA Expression Profiles Based on Maize Expressed Sequence Tag Frequency Analysis and Micro-Array Hybridization1 , 2002, Plant Physiology.

[39]  Christopher J. Lee,et al.  A genomic view of alternative splicing , 2002, Nature Genetics.

[40]  Graziano Pesole,et al.  UTRdb and UTRsite: specialized databases of sequences and functional elements of 5' and 3' untranslated regions of eukaryotic mRNAs. Update 2002 , 2002, Nucleic Acids Res..

[41]  Y.-H. Huang,et al.  PALS db: Putative Alternative Splicing database , 2002, Nucleic Acids Res..

[42]  Srinivas Aluru,et al.  Efficient clustering of large EST data sets on parallel computers. , 2003, Nucleic acids research.

[43]  John W. S. Brown,et al.  Arabidopsis consensus intron sequences , 1996, Plant Molecular Biology.

[44]  V. Brendel,et al.  Computational modeling of gene structure in Arabidopsis thaliana , 2004, Plant Molecular Biology.