ExpaRNA-P: simultaneous exact pattern matching and folding of RNAs

BackgroundIdentifying sequence-structure motifs common to two RNAs can speed up the comparison of structural RNAs substantially. The core algorithm of the existent approach ExpaRNA solves this problem for a priori known input structures. However, such structures are rarely known; moreover, predicting them computationally is no rescue, since single sequence structure prediction is highly unreliable.ResultsThe novel algorithm ExpaRNA-P computes exactly matching sequence-structure motifs in entire Boltzmann-distributed structure ensembles of two RNAs; thereby we match and fold RNAs simultaneously, analogous to the well-known “simultaneous alignment and folding” of RNAs. While this implies much higher flexibility compared to ExpaRNA, ExpaRNA-P has the same very low complexity (quadratic in time and space), which is enabled by its novel structure ensemble-based sparsification. Furthermore, we devise a generalized chaining algorithm to compute compatible subsets of ExpaRNA-P’s sequence-structure motifs. Resulting in the very fast RNA alignment approach ExpLoc-P, we utilize the best chain as anchor constraints for the sequence-structure alignment tool LocARNA. ExpLoc-P is benchmarked in several variants and versus state-of-the-art approaches. In particular, we formally introduce and evaluate strict and relaxed variants of the problem; the latter makes the approach sensitive to compensatory mutations. Across a benchmark set of typical non-coding RNAs, ExpLoc-P has similar accuracy to LocARNA but is four times faster (in both variants), while it achieves a speed-up over 30-fold for the longest benchmark sequences (≈400nt). Finally, different ExpLoc-P variants enable tailoring of the method to specific application scenarios. ExpaRNA-P and ExpLoc-P are distributed as part of the LocARNA package. The source code is freely available at http://www.bioinf.uni-freiburg.de/Software/ExpaRNA-P.ConclusionsExpaRNA-P’s novel ensemble-based sparsification reduces its complexity to quadratic time and space. Thereby, ExpaRNA-P significantly speeds up sequence-structure alignment while maintaining the alignment quality. Different ExpaRNA-P variants support a wide range of applications.

[1]  Ian Holmes,et al.  Stem Stem Stem Stem Loop Loop Loop LoopLoop Loop Loop Loop Loop Loop Loop , 2005 .

[2]  David Haussler,et al.  Identification and Classification of Conserved RNA Secondary Structures in the Human Genome , 2006, PLoS Comput. Biol..

[3]  Sean R. Eddy,et al.  Efficient pairwise RNA structure prediction and alignment using sequence alignment constraints , 2006, BMC Bioinformatics.

[4]  B. Berger,et al.  MSARI: multiple sequence alignments for statistical detection of RNA secondary structure. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Rolf Backofen,et al.  Backofen R: MARNA: multiple alignment and consensus structure prediction of RNAs based on sequence structure comparisons , 2005 .

[6]  ENCODEConsortium,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[7]  D. Reidel,et al.  The Transcriptional Landscape of the Mammalian Genome The FANTOM Consortium* and RIKEN Genome Exploration Research Group and Genome Science Group (Genome Network Project Core Group)* , 2005 .

[8]  Knut Reinert,et al.  Accurate multiple sequence-structure alignment of RNA sequences using combinatorial optimization , 2007, BMC Bioinformatics.

[9]  Robert Giegerich,et al.  Local similarity in RNA secondary structures , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[10]  D. Sankoff Simultaneous Solution of the RNA Folding, Alignment and Protosequence Problems , 1985 .

[11]  Ron Shamir,et al.  A Faster Algorithm for RNA Co-folding , 2008, WABI.

[12]  S. Salzberg,et al.  The Transcriptional Landscape of the Mammalian Genome , 2005, Science.

[13]  P. Stadler,et al.  Widespread purifying selection on RNA structure in mammals , 2013, Nucleic acids research.

[14]  Gad M. Landau,et al.  Local Exact Pattern Matching for Non-Fixed RNA Structures , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[15]  D. Turner,et al.  Dynalign: an algorithm for finding the secondary structure common to two RNA sequences. , 2002, Journal of molecular biology.

[16]  A. Wilm,et al.  A benchmark of multiple sequence alignment programs upon structural RNAs , 2005, Nucleic acids research.

[17]  Lior Pachter,et al.  Specific alignment of structured RNA: stochastic grammars and sequence annealing , 2008, Bioinform..

[18]  Rolf Backofen,et al.  Fast detection of common sequence structure patterns in RNAs , 2004, J. Discrete Algorithms.

[19]  Rolf Backofen,et al.  Local Sequence-structure Motifs in Rna , 2004, J. Bioinform. Comput. Biol..

[20]  Ivo L Hofacker,et al.  RNAs everywhere: genome-wide annotation of structured RNAs. , 2006, Genome informatics. International Conference on Genome Informatics.

[21]  Rolf Backofen,et al.  Variations on RNA folding and alignment: lessons from Benasque , 2007, Journal of mathematical biology.

[22]  Kristin Reiche,et al.  Structural profiles of human miRNA families from pairwise clustering , 2009, Bioinform..

[23]  Sean R. Eddy,et al.  Rfam 11.0: 10 years of RNA families , 2012, Nucleic Acids Res..

[24]  Rolf Backofen,et al.  Inferring Noncoding RNA Families and Classes by Means of Genome-Scale Structure-Based Clustering , 2007, PLoS Comput. Biol..

[25]  W. Cleveland LOWESS: A Program for Smoothing Scatterplots by Robust Locally Weighted Regression , 1981 .

[26]  Manolis Kellis,et al.  New families of human regulatory RNA structures identified by comparative analysis of vertebrate genomes. , 2011, Genome research.

[27]  Stefan Washietl,et al.  Identifying Structural Noncoding RNAs Using RNAz , 2007, Current protocols in bioinformatics.

[28]  Gaurav Sharma,et al.  Efficient pairwise RNA structure prediction using probabilistic alignment constraints in Dynalign , 2007, BMC Bioinformatics.

[29]  Gad M. Landau,et al.  Exact Pattern Matching for RNA Structure Ensembles , 2012, RECOMB.

[30]  Elena Rivas,et al.  Noncoding RNA gene detection using comparative sequence analysis , 2001, BMC Bioinformatics.

[31]  Michal Ziv-Ukelson,et al.  A Study of Accessible Motifs and RNA Folding Complexity , 2006, RECOMB.

[32]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[33]  Chuan-Sheng Foo,et al.  A max-margin model for efficient simultaneous alignment and folding of RNA sequences , 2008, ISMB.

[34]  P. Stadler,et al.  LocARNA-P: accurate boundary prediction and improved detection of structural RNAs. , 2012, RNA.

[35]  Sonja J. Prohaska,et al.  RNAs everywhere: genome-wide annotation of structured RNAs. , 2007, Journal of experimental zoology. Part B, Molecular and developmental evolution.

[36]  Laurie J. Heyer,et al.  Finding the most significant common sequence and structure motifs in a set of RNA sequences. , 1997, Nucleic acids research.

[37]  J. McCaskill The equilibrium partition function and base pair binding probabilities for RNA secondary structure , 1990, Biopolymers.

[38]  Rolf Backofen,et al.  Structural Local Multiple Alignment of RNA , 2008, German Conference on Bioinformatics.

[39]  Jan Gorodkin,et al.  Multiple structural alignment and clustering of RNA sequences , 2007, Bioinform..

[40]  J. Mattick,et al.  A global view of genomic information--moving beyond the gene and the master regulator. , 2010, Trends in genetics : TIG.

[41]  Rolf Backofen,et al.  Sparse RNA Folding: Time and Space Efficient Algorithms , 2009, CPM.

[42]  G. Helt,et al.  Transcriptional Maps of 10 Human Chromosomes at 5-Nucleotide Resolution , 2005, Science.

[43]  Gary D. Stormo,et al.  Pairwise local structural alignment of RNA sequences with sequence similarity less than 40% , 2005, Bioinform..

[44]  Michael Beckstette,et al.  Lightweight comparison of RNAs based on exact sequence–structure matches , 2009, German Conference on Bioinformatics.

[45]  Rolf Backofen,et al.  Time and Space Efficient RNA-RNA Interaction Prediction via Sparse Folding , 2010, RECOMB.

[46]  Sebastian Will,et al.  Structure-based whole-genome realignment reveals many novel noncoding RNAs , 2012, RECOMB.

[47]  Thomas E. Royce,et al.  Global Identification of Human Transcribed Sequences with Genome Tiling Arrays , 2004, Science.

[48]  Andreas Wilm,et al.  An enhanced RNA alignment benchmark for sequence alignment programs , 2006, Algorithms for Molecular Biology.

[49]  Peter F. Stadler,et al.  Alignment of RNA base pairing probability matrices , 2004, Bioinform..