Ab initio reconstruction of transcriptomes of pluripotent and lineage committed cells reveals gene structures of thousands of lincRNAs

RNA-Seq provides an unbiased way to study a transcriptome, including both coding and non-coding genes. To date, most RNA-Seq studies have critically depended on existing annotations, and thus focused on expression levels and variation in known transcripts. Here, we present Scripture, a method to reconstruct the transcriptome of a mammalian cell using only RNA-Seq reads and the genome sequence. We apply it to mouse embryonic stem cells, neuronal precursor cells, and lung fibroblasts to accurately reconstruct the full-length gene structures for the vast majority of known expressed genes. We identify substantial variation in protein-coding genes, including thousands of novel 5′-start sites, 3′-ends, and internal coding exons. We then determine the gene structures of over a thousand lincRNA and antisense loci. Our results open the way to direct experimental manipulation of thousands of non-coding RNAs, and demonstrate the power of ab initio reconstruction to render a comprehensive picture of mammalian transcriptomes.

[1]  Carolyn J. Brown,et al.  A gene from the region of the human X inactivation centre is expressed exclusively from the inactive X chromosome , 1991, Nature.

[2]  Paul Schliekelman,et al.  Statistical Methods in Bioinformatics: An Introduction , 2001 .

[3]  Thomas E. Royce,et al.  Global Identification of Human Transcribed Sequences with Genome Tiling Arrays , 2004, Science.

[4]  S. Salzberg,et al.  The Transcriptional Landscape of the Mammalian Genome , 2005, Science.

[5]  Austin G Smith,et al.  Niche-Independent Symmetrical Self-Renewal of a Mammalian Tissue Stem Cell , 2005, PLoS biology.

[6]  S. Batalov,et al.  A Strategy for Probing the Function of Noncoding RNAs Finds a Repressor of NFAT , 2005, Science.

[7]  S. Batalov,et al.  Antisense Transcription in the Mammalian Transcriptome , 2005, Science.

[8]  T. Mikkelsen,et al.  Genome-wide maps of chromatin state in pluripotent and lineage-committed cells , 2007, Nature.

[9]  P. Stadler,et al.  RNA Maps Reveal New RNA Classes and a Possible Function for Pervasive Transcription , 2007, Science.

[10]  A. J. Schroeder,et al.  Revisiting the protein-coding gene catalog of Drosophila melanogaster using 12 fly genomes. , 2007, Genome research.

[11]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[12]  Howard Y. Chang,et al.  Functional Demarcation of Active and Silent Chromatin Domains in Human HOX Loci by Noncoding RNAs , 2007, Cell.

[13]  S. Ranade,et al.  Stem cell transcriptome profiling via massive-scale mRNA sequencing , 2008, Nature Methods.

[14]  Eric T. Wang,et al.  Alternative Isoform Regulation in Human Tissue Transcriptomes , 2008, Nature.

[15]  R. Lister,et al.  Highly Integrated Single-Base Resolution Maps of the Epigenome in Arabidopsis , 2008, Cell.

[16]  F. Denoeud,et al.  Annotating genomes with massive-scale RNA sequencing , 2008, Genome Biology.

[17]  B. Frey,et al.  Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing , 2008, Nature Genetics.

[18]  Manolis Kellis,et al.  Performance and Scalability of Discriminative Metrics for Comparative Gene Identification in 12 Drosophila Genomes , 2008, PLoS Comput. Biol..

[19]  Jeannie T. Lee,et al.  Polycomb Proteins Targeted by a Short Repeat RNA to the Mouse X Chromosome , 2008, Science.

[20]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[21]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[22]  Eric T. Wang,et al.  An Abundance of Ubiquitously Expressed Genes Revealed by Tissue Transcriptome Sequence Data , 2009, PLoS Comput. Biol..

[23]  Xiaohui Xie,et al.  Identifying novel constrained elements by exploiting biased substitution patterns , 2009, Bioinform..

[24]  Lee T. Sam,et al.  Transcriptome Sequencing to Detect Gene Fusions in Cancer , 2009, Nature.

[25]  J. Rinn,et al.  Many human large intergenic noncoding RNAs associate with chromatin-modifying complexes and affect gene expression , 2009, Proceedings of the National Academy of Sciences.

[26]  Lior Pachter,et al.  Sequence Analysis , 2020, Definitions.

[27]  Hunter B. Fraser,et al.  Ab initio construction of a eukaryotic transcriptome by massively parallel mRNA sequencing , 2009, Proceedings of the National Academy of Sciences.

[28]  Inanç Birol,et al.  De novo transcriptome assembly with ABySS , 2009, Bioinform..

[29]  Michael F. Lin,et al.  Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals , 2009, Nature.

[30]  J. Maguire,et al.  Integrative analysis of the melanoma transcriptome. , 2010, Genome research.

[31]  M. Gerstein,et al.  Dynamic transcriptomes during neural differentiation of human embryonic stem cells revealed by short, long, and paired-end sequencing , 2010, Proceedings of the National Academy of Sciences.