LongSAGE analysis significantly improves genome annotation: identifications of novel genes and alternative transcripts in the mouse

MOTIVATION Owing to its increased tag length, LongSAGE tags are expected to be more reliable in direct assignment to genome sequences. Therefore, we evaluated the use of LongSAGE data in genome annotation by using our LongSAGE dataset of 202 015 tags (consisting of 41 718 unique tags), experimentally generated from mouse embryonic tail libraries. RESULTS A fraction of LongSAGE tags could not be unambiguously assigned to its gene, due to the presence of widely conserved sequences downstream of particular CATG anchor sites. The presence of alternative forms of transcripts was confirmed in 45% of all detected genes. Surprisingly, a large fraction of LongSAGE tags with hits to the genome (66%) could not be assigned to any gene annotated in EnsEMBL. Among such cases, 2098 LongSAGE tags fell into a region containing a putative gene predicted by GenScan, providing experimental evidence for the presence of real genes, while 9112 genes were found out to be left out or wrongly annotated by the EnsEMBL pipeline. CONCLUSIONS LongSAGE transcriptome data can significantly improve the genome annotation by identifying novel genes and alternative transcripts, even in the case of thus far best-characterized organisms like the mouse. CONTACT imai@gsf.de.

[1]  E. Birney,et al.  Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs , 2002, Nature.

[2]  L. O. Penalva,et al.  RNA Binding Protein Sex-Lethal (Sxl) and Control of Drosophila Sex Determination and Dosage Compensation , 2003, Microbiology and Molecular Biology Reviews.

[3]  Michael Q. Zhang Computational prediction of eukaryotic protein-coding genes , 2002, Nature Reviews Genetics.

[4]  Alan K. Mackworth,et al.  Evaluation of gene-finding programs on mammalian sequences. , 2001, Genome research.

[5]  Simon C. Potter,et al.  An overview of Ensembl. , 2004, Genome research.

[6]  Matthias B. Wahl,et al.  Transcriptome analysis of early chondrogenesis in ATDC5 cells induced by bone morphogenetic protein 4. , 2004, Genomics.

[7]  H. Kitano,et al.  Computational systems biology , 2002, Nature.

[8]  Ulrich Heinzmann,et al.  LongSAGE analysis revealed the presence of a large number of novel antisense genes in the mouse genome , 2005, Bioinform..

[9]  Ji Huang,et al.  [Serial analysis of gene expression]. , 2002, Yi chuan = Hereditas.

[10]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[11]  Viatcheslav R. Akmaev,et al.  Correction of sequence-based artifacts in serial analysis of gene expression , 2004, Bioinform..

[12]  L. Pachter,et al.  SLAM: cross-species gene finding and alignment with a generalized pair hidden Markov model. , 2003, Genome research.

[13]  S. Altschul,et al.  SAGEmap: a public gene expression resource. , 2000, Genome research.

[14]  P. Bork,et al.  Alternative splicing and genome complexity , 2002, Nature Genetics.

[15]  A. Sparks,et al.  Using the transcriptome to annotate the genome , 2002, Nature Biotechnology.

[16]  D Gautheret,et al.  Identification of alternate polyadenylation sites and analysis of their tissue distribution using EST data. , 2001, Genome research.

[17]  P. Green,et al.  Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.

[18]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[19]  Terry Gaasterland,et al.  Impact of alternative initiation, splicing, and termination on the diversity of the mRNA transcripts encoded by the mouse transcriptome. , 2003, Genome research.

[20]  Colin N. Dewey,et al.  Initial sequencing and comparative analysis of the mouse genome. , 2002 .

[21]  S. Kuersten,et al.  The power of the 3′ UTR: translational control and development , 2003, Nature Reviews Genetics.

[22]  Miao Sun,et al.  SAGE is far more sensitive than EST for detecting low-abundance transcripts , 2004, BMC Genomics.

[23]  E. H. Margulies,et al.  A comparative molecular analysis of developing mouse forelimbs and hindlimbs using serial analysis of gene expression (SAGE). , 2001, Genome research.

[24]  Marco A Marra,et al.  Assessment of SAGE in transcript identification. , 2003, Genome research.

[25]  P. Bork,et al.  Bioinformatics in the post-sequence era , 2003, Nature Genetics.

[26]  T. Andrews,et al.  The Ensembl automatic gene annotation system. , 2004, Genome research.

[27]  R. Durbin,et al.  GeneWise and Genomewise. , 2004, Genome research.

[28]  Peter Winter,et al.  Gene expression analysis of plant host–pathogen interactions by SuperSAGE , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[29]  Judith A. Blake,et al.  The Mouse Genome Database (MGD): integrating biology with the genome , 2004, Nucleic Acids Res..

[30]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[31]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[32]  Matthew R. Pocock,et al.  The Bioperl toolkit: Perl modules for the life sciences. , 2002, Genome research.

[33]  E. Birney,et al.  The Ensembl core software libraries. , 2004, Genome research.

[34]  Kanako O. Koyanagi,et al.  Integrative Annotation of 21,037 Human Genes Validated by Full-Length cDNA Clones , 2004, PLoS Biology.

[35]  Lincoln Stein,et al.  Genome annotation: from sequence to biology , 2001, Nature Reviews Genetics.