Prediction and characterization of noncoding RNAs in C. elegans by integrating conservation, secondary structure, and high-throughput sequencing and array data.

We present an integrative machine learning method, incRNA, for whole-genome identification of noncoding RNAs (ncRNAs). It combines a large amount of expression data, RNA secondary-structure stability, and evolutionary conservation at the protein and nucleic-acid level. Using the incRNA model and data from the modENCODE consortium, we are able to separate known C. elegans ncRNAs from coding sequences and other genomic elements with a high level of accuracy (97% AUC on an independent validation set), and find more than 7000 novel ncRNA candidates, among which more than 1000 are located in the intergenic regions of C. elegans genome. Based on the validation set, we estimate that 91% of the approximately 7000 novel ncRNA candidates are true positives. We then analyze 15 novel ncRNA candidates by RT-PCR, detecting the expression for 14. In addition, we characterize the properties of all the novel ncRNA candidates and find that they have distinct expression patterns across developmental stages and tend to use novel RNA structural families. We also find that they are often targeted by specific transcription factors (∼59% of intergenic novel ncRNA candidates). Overall, our study identifies many new potential ncRNAs in C. elegans and provides a method that can be adapted to other organisms.

[1]  Raymond K. Auerbach,et al.  Integrative Analysis of the Caenorhabditis elegans Genome by the modENCODE Project , 2010, Science.

[2]  M. Gerstein,et al.  Comparison and calibration of transcriptome data from RNA-Seq and tiling arrays , 2010, BMC Genomics.

[3]  Raymond K. Auerbach,et al.  Genome-Wide Identification of Binding Sites Defines Distinct Functions for Caenorhabditis elegans PHA-4/FOXA in Development and Environmental Response , 2010, PLoS genetics.

[4]  John S Mattick,et al.  Identification of novel non-coding RNAs using profiles of short sequence reads from next generation sequencing data , 2010, BMC Genomics.

[5]  Brian Luke,et al.  TERRA: telomeric repeat-containing RNA , 2009, The EMBO journal.

[6]  M. Gerstein,et al.  Unlocking the secrets of the genome , 2009, Nature.

[7]  Zachary Pincus,et al.  Dynamic expression of small non-coding RNAs, including novel microRNAs and piRNAs/21U-RNAs, during Caenorhabditis elegans development , 2009, Genome Biology.

[8]  Sean R. Eddy,et al.  Infernal 1.0: inference of RNA alignments , 2009, Bioinform..

[9]  R. Sachidanandam,et al.  Post-transcriptional processing generates a diversity of 5′-modified long and short RNAs , 2009, Nature.

[10]  Michael F. Lin,et al.  Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals , 2009, Nature.

[11]  Patricia P. Chan,et al.  GtRNAdb: a database of transfer RNA genes detected in genomic sequence , 2008, Nucleic Acids Res..

[12]  Robert D. Finn,et al.  Rfam: updates to the RNA families database , 2008, Nucleic Acids Res..

[13]  W. L. Ruzzo,et al.  Comparative genomics beyond sequence-based alignments: RNA structures in the ENCODE regions. , 2008, Genome research.

[14]  S. Sunkin,et al.  Specific expression of long noncoding RNAs in the mouse brain , 2008, Proceedings of the National Academy of Sciences.

[15]  W. Theurkauf,et al.  Biogenesis and germline functions of piRNAs , 2007, Development.

[16]  Stijn van Dongen,et al.  miRBase: tools for microRNA genomics , 2007, Nucleic Acids Res..

[17]  Wei Zhou,et al.  Mapping the C. elegans noncoding transcriptome with a whole-genome tiling microarray. , 2007, Genome research.

[18]  Jan Gorodkin,et al.  Fast Pairwise Structural RNA Alignments by Pruning of the Dynamical Programming Matrix , 2007, PLoS Comput. Biol..

[19]  M. Gerstein,et al.  Structured Rnas in the Encode Selected Regions of the Human Genome , 2022 .

[20]  Petrus Tang,et al.  Intronic microRNA: discovery and biological implications. , 2007, DNA and cell biology.

[21]  Gaurav Sharma,et al.  Efficient pairwise RNA structure prediction using probabilistic alignment constraints in Dynalign , 2007, BMC Bioinformatics.

[22]  P. Stadler,et al.  Prediction of structured non-coding RNAs in the genomes of the nematodes Caenorhabditis elegans and Caenorhabditis briggsae. , 2006, Journal of experimental zoology. Part B, Molecular and developmental evolution.

[23]  J. Gorodkin,et al.  Thousands of corresponding human and mouse genomic regions unalignable in primary sequence contain common RNA structure. , 2006, Genome research.

[24]  David H. Mathews,et al.  Detection of non-coding RNAs on the basis of predicted secondary structure formation free energy change , 2006, BMC Bioinformatics.

[25]  Laurent Lestrade,et al.  snoRNA-LBME-db, a comprehensive database of human H/ACA and C/D box snoRNAs , 2005, Nucleic Acids Res..

[26]  James R. Knight,et al.  Genome sequencing in microfabricated high-density picolitre reactors , 2005, Nature.

[27]  J. Shendure,et al.  Materials and Methods Som Text Figs. S1 and S2 Tables S1 to S4 References Accurate Multiplex Polony Sequencing of an Evolved Bacterial Genome , 2022 .

[28]  Karin Kiontke,et al.  The phylogenetic relationships of Caenorhabditis and other rhabditids. , 2005, WormBook : the online review of C. elegans biology.

[29]  Peter Schattner,et al.  The tRNAscan-SE, snoscan and snoGPS web servers for the detection of tRNAs and snoRNAs , 2005, Nucleic Acids Res..

[30]  Peter F Stadler,et al.  Fast and reliable prediction of noncoding RNAs , 2005, Proc. Natl. Acad. Sci. USA.

[31]  B. Berger,et al.  MSARI: multiple sequence alignments for statistical detection of RNA secondary structure. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[32]  D. Bartel MicroRNAs Genomics, Biogenesis, Mechanism, and Function , 2004, Cell.

[33]  Diego di Bernardo,et al.  ddbRNA: detection of conserved secondary structures in multiple alignments , 2003, Bioinform..

[34]  C. Burge,et al.  Widespread selection for local RNA secondary structure in coding regions of bacterial genes. , 2003, Genome research.

[35]  Elena Rivas,et al.  Noncoding RNA gene detection using comparative sequence analysis , 2001, BMC Bioinformatics.

[36]  S. Eddy,et al.  tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. , 1997, Nucleic acids research.

[37]  W. L. Ruzzo,et al.  De novo prediction of structured RNAs from genomic sequences. , 2010, Trends in biotechnology.

[38]  Raymond K. Auerbach,et al.  PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls , 2009, Nature Biotechnology.

[39]  David Haussler,et al.  Identification and Classification of Conserved RNA Secondary Structures in the Human Genome , 2006, PLoS Comput. Biol..