Unexpected observations after mapping LongSAGE tags to the human genome

BackgroundSAGE has been used widely to study the expression of known transcripts, but much less to annotate new transcribed regions. LongSAGE produces tags that are sufficiently long to be reliably mapped to a whole-genome sequence. Here we used this property to study the position of human LongSAGE tags obtained from all public libraries. We focused mainly on tags that do not map to known transcripts.ResultsUsing a published error rate in SAGE libraries, we first removed the tags likely to result from sequencing errors. We then observed that an unexpectedly large number of the remaining tags still did not match the genome sequence. Some of these correspond to parts of human mRNAs, such as polyA tails, junctions between two exons and polymorphic regions of transcripts. Another non-negligible proportion can be attributed to contamination by murine transcripts and to residual sequencing errors. After filtering out our data with these screens to ensure that our dataset is highly reliable, we studied the tags that map once to the genome. 31% of these tags correspond to unannotated transcripts. The others map to known transcribed regions, but many of them (nearly half) are located either in antisense or in new variants of these known transcripts.ConclusionWe performed a comprehensive study of all publicly available human LongSAGE tags, and carefully verified the reliability of these data. We found the potential origin of many tags that did not match the human genome sequence. The properties of the remaining tags imply that the level of sequencing error may have been under-estimated. The frequency of tags matching once the genome sequence but not in an annotated exon suggests that the human transcriptome is much more complex than shown by the current human genome annotations, with many new splicing variants and antisense transcripts. SAGE data is appropriate to map new transcripts to the genome, as demonstrated by the high rate of cross-validation of the corresponding tags using other methods.

[1]  Eli Eisenberg,et al.  Letter from the editor: adenosine‐to‐inosine RNA editing in Alu repeats in the human genome , 2005, EMBO reports.

[2]  Jennifer Daub,et al.  Expressed sequence tags: medium-throughput protocols. , 2004, Methods in molecular biology.

[3]  Viatcheslav R. Akmaev,et al.  Correction of sequence-based artifacts in serial analysis of gene expression , 2004, Bioinform..

[4]  Thérèse Commes,et al.  Mining SAGE data allows large-scale, sensitive screening of antisense transcript expression. , 2004, Nucleic acids research.

[5]  Kenneth H Buetow,et al.  An anatomy of normal and malignant gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[6]  K. Boheler,et al.  Embryonic stem cells: prospects for developmental biology and cell therapy. , 2005, Physiological reviews.

[7]  Dennis B. Troup,et al.  NCBI GEO: mining millions of expression profiles—database and tools , 2004, Nucleic Acids Res..

[8]  Toshihiro Tanaka The International HapMap Project , 2003, Nature.

[9]  J. Rinn,et al.  The transcriptional activity of human Chromosome 22. , 2003, Genes & development.

[10]  Graziano Pesole,et al.  UTRdb and UTRsite: a collection of sequences and regulatory motifs of the untranslated regions of eukaryotic mRNAs , 2004, Nucleic Acids Res..

[11]  Joseph M. Dale,et al.  Empirical Analysis of Transcriptional Activity in the Arabidopsis Genome , 2003, Science.

[12]  T. Matise,et al.  Widespread RNA editing of embedded alu elements in the human transcriptome. , 2004, Genome research.

[13]  Allen D. Delaney,et al.  Large-scale production of SAGE libraries from microdissected tissues, flow-sorted cells, and cell lines. , 2006, Genome research.

[14]  K. Nishikura,et al.  ADAR gene family and A-to-I RNA editing: diverse roles in posttranscriptional gene regulation. , 2005, Progress in nucleic acid research and molecular biology.

[15]  E. Schadt,et al.  Dark matter in the genome: evidence of widespread transcription detected by microarray tiling experiments. , 2005, Trends in genetics : TIG.

[16]  A. Ryo,et al.  Use of serial analysis of gene expression (SAGE) technology. , 2001, Journal of immunological methods.

[17]  M. Boguski,et al.  dbEST — database for “expressed sequence tags” , 1993, Nature Genetics.

[18]  Damian Smedley,et al.  Ensembl 2005 , 2004, Nucleic Acids Res..

[19]  Olivier Gandrillon,et al.  Identitag, a relational database for SAGE tag identification and interspecies comparison of SAGE libraries , 2004, BMC Bioinformatics.

[20]  L. Duret,et al.  Evidence that functional transcription units cover at least half of the human genome. , 2004, Trends in genetics : TIG.

[21]  Alexander Rich,et al.  Widespread A-to-I RNA Editing of Alu-Containing mRNAs in the Human Transcriptome , 2004, PLoS biology.

[22]  Jacques Colinge,et al.  Bioinformatics Applications Note Detecting the Impact of Sequencing Errors on Sage Data , 2022 .

[23]  Sarah Barber,et al.  A mouse atlas of gene expression: large-scale digital gene-expression profiles from precisely defined developing C57BL/6J mouse tissues and cells. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[24]  R. Stoughton,et al.  Genetics of gene expression surveyed in maize, mouse and man , 2003, Nature.

[25]  Lukas Wagner,et al.  A Greedy Algorithm for Aligning DNA Sequences , 2000, J. Comput. Biol..

[26]  A. Sparks,et al.  Using the transcriptome to annotate the genome , 2002, Nature Biotechnology.

[27]  Colin N. Dewey,et al.  Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution , 2004, Nature.

[28]  J. Stollberg,et al.  A quantitative evaluation of SAGE. , 2000, Genome research.

[29]  Marco A Marra,et al.  Assessment of SAGE in transcript identification. , 2003, Genome research.

[30]  F. Baas,et al.  The Human Transcriptome Map: Clustering of Highly Expressed Genes in Chromosomal Domains , 2001, Science.

[31]  Sandro J De Souza,et al.  The impact of SNPs on the interpretation of SAGE and MPSS experimental data. , 2004, Nucleic acids research.

[32]  T. Andrews,et al.  The Ensembl automatic gene annotation system. , 2004, Genome research.

[33]  Zipora Y. Fligelman,et al.  Systematic identification of abundant A-to-I editing sites in the human transcriptome , 2004, Nature Biotechnology.

[34]  H. Bussemaker,et al.  The human transcriptome map reveals extremes in gene density, intron length, GC content, and repeat pattern for domains of highly and weakly expressed genes. , 2003, Genome research.

[35]  Ulrich Heinzmann,et al.  LongSAGE analysis significantly improves genome annotation: identifications of novel genes and alternative transcripts in the mouse , 2005, Bioinform..

[36]  P. Unneberg,et al.  Transcript identification by analysis of short sequence tags—influence of tag length, restriction site and transcript database , 2002 .

[37]  G. Helt,et al.  Transcriptional Maps of 10 Human Chromosomes at 5-Nucleotide Resolution , 2005, Science.

[38]  Ji Huang,et al.  [Serial analysis of gene expression]. , 2002, Yi chuan = Hereditas.

[39]  S. Altschul,et al.  SAGEmap: a public gene expression resource. , 2000, Genome research.

[40]  S. P. Fodor,et al.  Large-Scale Transcriptional Activity in Chromosomes 21 and 22 , 2002, Science.

[41]  Jun Chen,et al.  A large quantity of novel human antisense transcripts detected by LongSAGE , 2006, Bioinform..

[42]  International Human Genome Sequencing Consortium Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution , 2004 .

[43]  Ben Lehner,et al.  In search of antisense. , 2004, Trends in biochemical sciences.

[44]  S. Cawley,et al.  Novel RNAs identified from an in-depth analysis of the transcriptome of human chromosomes 21 and 22. , 2004, Genome research.