Scalable structural clustering of local RNA secondary structures

Here, we propose an alignment-free approach for clustering RNA sequences according to sequence and structure information. We extend a fast graph kernel technique that we have developed for chemoinformatics applications and we adapt it to detect similarities between RNA secondary structures. The key novelties are twofold: (1) we represent multiple folding hypothesis associated to a single RNA sequence in a flexible graph format; and (2) we efficiently convert the graph encoding into a very high dimensional sparse vectors. The first strategy allows us to compensate the inaccuracies of the minimum free energy solution. The second strategy allows us to use locality sensitive hashing methods to identify clusters with a complexity that is linear in the number of sequences N, i.e. avoiding the quadratic complexity arising from pairwise similarity computations. We have integrated the approach in a ready-to-use pipeline for large-scale clustering of putative ncRNA. The method has been evaluated on known ncRNA classes and compared against existing approaches such as LocARNA and RNASOUP. We show that not only we obtain clusters of high quality, but also we achieve striking speedups: from years to days for serial computation, down to hours when considering the parallel implementation. We applied our method to six heterogeneous large-scale data sets containing more than 220,000 sequence fragments in total. We have analyzed predicted short ncRNAs which were lacking reliable class assignments and we have searched for local structural elements specific to experimentally validated lincRNAs. In this latter case we found enriched GO-terms for lincRNAs containing predicted local motifs that suggest a connection to vital processes of the human nervous system.

[1]  Jan Gorodkin,et al.  Multiple structural alignment and clustering of RNA sequences , 2007, Bioinform..

[2]  Gene W. Tyson,et al.  Metatranscriptomics reveals unique microbial small RNAs in the ocean’s water column , 2009, Nature.

[3]  A. Wilm,et al.  A benchmark of multiple sequence alignment programs upon structural RNAs , 2005, Nucleic acids research.

[4]  V. Kunin,et al.  Evolutionary conservation of sequence and secondary structures in CRISPR repeats , 2007, Genome Biology.

[5]  Stephan H. Bernhart,et al.  RNPomics: Defining the ncRNA transcriptome by cDNA library generation from ribonucleo-protein particles , 2010, Nucleic acids research.

[6]  Fabrice Jossinet,et al.  Proceedings of the ECCB ’ 14 workshop on Computational Methods for Structural RNAs ( CMSR ’ 14 ) Strasbourg , France , 2014 .

[7]  Sebastian Will,et al.  RNAalifold: improved consensus structure prediction for RNA alignments , 2008, BMC Bioinformatics.

[8]  Fabrizio Costa,et al.  Fast Neighborhood Subgraph Pairwise Distance Kernel , 2010, ICML.

[9]  D. Sankoff Simultaneous Solution of the RNA Folding, Alignment and Protosequence Problems , 1985 .

[10]  Sonja J. Prohaska,et al.  RNAs everywhere: genome-wide annotation of structured RNAs. , 2007, Journal of experimental zoology. Part B, Molecular and developmental evolution.

[11]  Alexander Hüttenhofer,et al.  cDNA library generation from ribonucleoprotein particles , 2011, Nature Protocols.

[12]  Paulo P. Amaral,et al.  The Reality of Pervasive Transcription , 2011, PLoS biology.

[13]  Robert D. Finn,et al.  Rfam: Wikipedia, clans and the “decimal” release , 2010, Nucleic Acids Res..

[14]  Olivier Voinnet,et al.  The long and the short of noncoding RNAs. , 2009, Current opinion in cell biology.

[15]  Kristin Reiche,et al.  Structural profiles of human miRNA families from pairwise clustering , 2009, Bioinform..

[16]  J. Gorodkin,et al.  Unifying evolutionary and thermodynamic information for RNA folding of multiple alignments , 2008, Nucleic acids research.

[17]  Rolf Backofen,et al.  Inferring Noncoding RNA Families and Classes by Means of Genome-Scale Structure-Based Clustering , 2007, PLoS Comput. Biol..

[18]  Bin Tian,et al.  RADAR: a web server for RNA data analysis and research , 2007, Nucleic Acids Res..

[19]  William Ritchie,et al.  RNA stem-loops: to be or not to be cleaved by RNAse III. , 2007, RNA.

[20]  Ivo L Hofacker,et al.  RNAs everywhere: genome-wide annotation of structured RNAs. , 2006, Genome informatics. International Conference on Genome Informatics.