GraphClust: alignment-free structural clustering of local RNA secondary structures

Motivation: Clustering according to sequence–structure similarity has now become a generally accepted scheme for ncRNA annotation. Its application to complete genomic sequences as well as whole transcriptomes is therefore desirable but hindered by extremely high computational costs. Results: We present a novel linear-time, alignment-free method for comparing and clustering RNAs according to sequence and structure. The approach scales to datasets of hundreds of thousands of sequences. The quality of the retrieved clusters has been benchmarked against known ncRNA datasets and is comparable to state-of-the-art sequence–structure methods although achieving speedups of several orders of magnitude. A selection of applications aiming at the detection of novel structural ncRNAs are presented. Exemplarily, we predicted local structural elements specific to lincRNAs likely functionally associating involved transcripts to vital processes of the human nervous system. In total, we predicted 349 local structural RNA elements. Availability: The GraphClust pipeline is available on request. Contact: backofen@informatik.uni-freiburg.de Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  Yutaka Saito,et al.  Fast and accurate clustering of noncoding RNAs using ensembles of sequence alignments and secondary structures , 2011, BMC Bioinformatics.

[2]  J. Gorodkin,et al.  Unifying evolutionary and thermodynamic information for RNA folding of multiple alignments , 2008, Nucleic acids research.

[3]  Cole Trapnell,et al.  Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. , 2011, Genes & development.

[4]  C. Ponting,et al.  Long noncoding RNA genes: conservation of sequence and brain expression among diverse amniotes , 2010, Genome Biology.

[5]  David Haussler,et al.  Identification and Classification of Conserved RNA Secondary Structures in the Human Genome , 2006, PLoS Comput. Biol..

[6]  Rolf Backofen,et al.  Backofen R: MARNA: multiple alignment and consensus structure prediction of RNAs based on sequence structure comparisons , 2005 .

[7]  Kiyoshi Asai,et al.  Directed acyclic graph kernels for structural RNA analysis , 2008, BMC Bioinformatics.

[8]  Zasha Weinberg,et al.  CMfinder - a covariance model based RNA motif finding algorithm , 2006, Bioinform..

[9]  David Haussler,et al.  Convolution kernels on discrete structures , 1999 .

[10]  J. Mattick,et al.  Long non-coding RNAs in nervous system function and disease , 2010, Brain Research.

[11]  Olivier Voinnet,et al.  The long and the short of noncoding RNAs. , 2009, Current opinion in cell biology.

[12]  Na Liu,et al.  A method for rapid similarity analysis of RNA secondary structures , 2006, BMC Bioinformatics.

[13]  N. Grishin,et al.  COMPASS: a tool for comparison of multiple protein alignments with assessment of statistical significance. , 2003, Journal of molecular biology.

[14]  William Ritchie,et al.  RNA stem-loops: to be or not to be cleaved by RNAse III. , 2007, RNA.

[15]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[16]  Qiang Li,et al.  Duplicated RNA genes in teleost Fish genomes , 2008, J. Bioinform. Comput. Biol..

[17]  Peter F Stadler,et al.  Fast and reliable prediction of noncoding RNAs , 2005, Proc. Natl. Acad. Sci. USA.

[18]  J. McCaskill The equilibrium partition function and base pair binding probabilities for RNA secondary structure , 1990, Biopolymers.

[19]  Paulo P. Amaral,et al.  The Eukaryotic Genome as an RNA Machine , 2008, Science.

[20]  Sebastian Will,et al.  RNAalifold: improved consensus structure prediction for RNA alignments , 2008, BMC Bioinformatics.

[21]  Kristin Reiche,et al.  Structural profiles of human miRNA families from pairwise clustering , 2009, Bioinform..

[22]  Robert Giegerich,et al.  Pure multiple RNA secondary structure alignments: a progressive profile approach , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[23]  R. Breaker,et al.  Comparative genomics reveals 104 candidate structured RNAs from bacteria, archaea, and their metagenomes , 2010, Genome Biology.

[24]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[25]  W. L. Ruzzo,et al.  De novo prediction of structured RNAs from genomic sequences. , 2010, Trends in biotechnology.

[26]  Sean R. Eddy,et al.  Infernal 1.0: inference of RNA alignments , 2009, Bioinform..

[27]  Gene W. Tyson,et al.  Metatranscriptomics reveals unique microbial small RNAs in the ocean’s water column , 2009, Nature.

[28]  A. Wilm,et al.  A benchmark of multiple sequence alignment programs upon structural RNAs , 2005, Nucleic acids research.

[29]  Sean R. Eddy,et al.  Infernal 1.0: inference of RNA alignments , 2009, Bioinform..

[30]  Fabrizio Costa,et al.  Fast Neighborhood Subgraph Pairwise Distance Kernel , 2010, ICML.

[31]  Robert Giegerich,et al.  Abstract shapes of RNA. , 2004, Nucleic acids research.

[32]  T. Schlick,et al.  Exploring the repertoire of RNA secondary motifs using graph theory; implications for RNA design. , 2003, Nucleic acids research.

[33]  D. Sankoff Simultaneous Solution of the RNA Folding, Alignment and Protosequence Problems , 1985 .

[34]  Ivo L Hofacker,et al.  RNAs everywhere: genome-wide annotation of structured RNAs. , 2006, Genome informatics. International Conference on Genome Informatics.

[35]  Brad T. Sherman,et al.  Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources , 2008, Nature Protocols.

[36]  Huei-Hun Tseng,et al.  Finding Non-coding RNAs Through Genome-Scale Clustering , 2008, APBC.

[37]  Michael F. Lin,et al.  Systematic identification of long noncoding RNAs expressed during zebrafish embryogenesis. , 2012, Genome research.

[38]  Robert D. Finn,et al.  Rfam: Wikipedia, clans and the “decimal” release , 2010, Nucleic Acids Res..

[39]  Sonja J. Prohaska,et al.  Computational RNomics of Drosophilids , 2007, BMC Genomics.

[40]  Bin Tian,et al.  RADAR: a web server for RNA data analysis and research , 2007, Nucleic Acids Res..

[41]  Piotr Indyk,et al.  Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality , 2012, Theory Comput..

[42]  Sonja J. Prohaska,et al.  RNAs everywhere: genome-wide annotation of structured RNAs. , 2007, Journal of experimental zoology. Part B, Molecular and developmental evolution.

[43]  Rolf Backofen,et al.  Inferring Noncoding RNA Families and Classes by Means of Genome-Scale Structure-Based Clustering , 2007, PLoS Comput. Biol..

[44]  Tatiana A. Tatusova,et al.  NCBI Reference Sequences: current status, policy and new initiatives , 2008, Nucleic Acids Res..

[45]  Alan Christoffels,et al.  Fugu genome analysis provides evidence for a whole-genome duplication early during the evolution of ray-finned fishes. , 2004, Molecular biology and evolution.

[46]  Manolis Kellis,et al.  New families of human regulatory RNA structures identified by comparative analysis of vertebrate genomes. , 2011, Genome research.

[47]  Paulo P. Amaral,et al.  The Reality of Pervasive Transcription , 2011, PLoS biology.

[48]  Jan Gorodkin,et al.  Multiple structural alignment and clustering of RNA sequences , 2007, Bioinform..

[49]  V. Kunin,et al.  Evolutionary conservation of sequence and secondary structures in CRISPR repeats , 2007, Genome Biology.

[50]  William Stafford Noble,et al.  Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project , 2007, Nature.

[51]  P. Stadler,et al.  LocARNA-P: accurate boundary prediction and improved detection of structural RNAs. , 2012, RNA.