StemSearch: RNA search tool based on stem identification and indexing

The discovery and functional analysis of noncoding RNA (ncRNA) systems in different organisms motivates the development of tools for aiding ncRNA research. Several tools exist that search for occurrences of a given RNA structural profile in genomic sequences. Yet, there is a need for an ”RNA BLAST” tool, i.e. a tool that takes a putative functional RNA sequence as input, and efficiently searches for similar sequences in genomic databases, taking into consideration potential secondary structure features of the input query sequence. This work aims at providing such a tool. Our tool, denoted StemSearch, is based on a structural representation of an RNA sequence by its potential stems. Potential stems in genomic sequences are identified in a preprocessing stage, and indexed. A user provided query sequence is likewise processed, and stems from the target genomes which are similar to the query stems are retrieved from the index. Then, relevant genomic regions are identified and ranked according to their similarity to the query stem-set while enforcing conservation of cross-stem topology. Experiments using RFAM families show significantly improved recall for StemSearch over BLAST, with small loss of precision. We further demonstrate our system's capability to handle eukaryotic genomes by successfully searching for members of the 7SK family in chromosome 2 of the human genome.

[1]  Zasha Weinberg,et al.  CMfinder - a covariance model based RNA motif finding algorithm , 2006, Bioinform..

[2]  B. Haas,et al.  Searching Genomes for Noncoding RNA Using FastR , 2005, TCBB.

[3]  Marc Toussaint,et al.  Probabilistic inference for solving discrete and continuous state Markov Decision Processes , 2006, ICML.

[4]  J. McCaskill The equilibrium partition function and base pair binding probabilities for RNA secondary structure , 1990, Biopolymers.

[5]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[6]  Deniss Kumlander,et al.  Improving the maximum-weight clique algorithm for the dense graphs , 2006 .

[7]  Rolf Backofen,et al.  LocARNAscan: Incorporating thermodynamic stability in sequence and structure-based RNA homology search , 2013, Algorithms for Molecular Biology.

[8]  G. Stormo,et al.  A graph theoretical approach for predicting common RNA secondary structure motifs including pseudoknots in unaligned sequences. , 2004, Bioinformatics.

[9]  Michael Zuker,et al.  Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information , 1981, Nucleic Acids Res..

[10]  Sam Griffiths-Jones,et al.  The microRNA Registry , 2004, Nucleic Acids Res..

[11]  Serafim Batzoglou,et al.  CONTRAfold: RNA secondary structure prediction without physics-based models , 2006, ISMB.

[12]  D. Turner,et al.  Improved free-energy parameters for predictions of RNA duplex stability. , 1986, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Satish Chikkagoudar,et al.  PLAST-ncRNA: Partition function Local Alignment Search Tool for non-coding RNA sequences , 2010, Nucleic Acids Res..

[14]  Rolf Backofen,et al.  Inferring Noncoding RNA Families and Classes by Means of Genome-Scale Structure-Based Clustering , 2007, PLoS Comput. Biol..

[15]  Sean R. Eddy,et al.  RSEARCH: Finding homologs of single structured RNA sequences , 2003, BMC Bioinformatics.

[16]  Ivo L. Hofacker,et al.  Vienna RNA secondary structure server , 2003, Nucleic Acids Res..

[17]  Matthias Zytnicki,et al.  BlastR—fast and accurate database searches for non-coding RNAs , 2011, Nucleic acids research.

[18]  Michael R. Fellows,et al.  Algorithms and complexity for annotated sequence analysis , 1999 .

[19]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[20]  Cédric Chauve,et al.  An Edit Distance Between RNA Stem-Loops , 2005, SPIRE.

[21]  A. Chinnaiyan,et al.  The emergence of lncRNAs in cancer biology. , 2011, Cancer discovery.

[22]  L. Maquat,et al.  lncRNAs transactivate Staufen1-mediated mRNA decay by duplexing with 3'UTRs via Alu elements , 2010, Nature.

[23]  Jan Krüger,et al.  RNAhybrid: microRNA target prediction easy, fast and flexible , 2006, Nucleic Acids Res..

[24]  D. Sankoff Simultaneous Solution of the RNA Folding, Alignment and Protosequence Problems , 1985 .

[25]  R. Breaker,et al.  Gene regulation by riboswitches , 2004, Nature Reviews Molecular Cell Biology.

[26]  JUSTIN ZOBEL,et al.  Inverted files for text search engines , 2006, CSUR.

[27]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[28]  The jerky and knotty dynamics of RNA , 2009 .

[29]  D. Ecker,et al.  RNAMotif, an RNA secondary structure definition and search algorithm. , 2001, Nucleic acids research.

[30]  Michael Beckstette,et al.  Fast online and index-based algorithms for approximate search of RNA sequence-structure patterns , 2013, BMC Bioinformatics.

[31]  R. Durbin,et al.  RNA sequence analysis using covariance models. , 1994, Nucleic acids research.

[32]  Yann Ponty,et al.  VARNA: Interactive drawing and editing of the RNA secondary structure , 2009, Bioinform..

[33]  James B. Brown,et al.  Long noncoding RNAs are rarely translated in two human cell lines , 2012, Genome research.

[34]  Patric R. J. Östergård,et al.  A New Algorithm for the Maximum-Weight Clique Problem , 1999, Electron. Notes Discret. Math..

[35]  Ali Nahvi,et al.  Genetic control by a metabolite binding mRNA. , 2002, Chemistry & biology.

[36]  István Miklós,et al.  Co-transcriptional folding is encoded within RNA genes , 2004, BMC Molecular Biology.

[37]  Roded Sharan,et al.  A sequence-based filtering method for ncRNA identification and its application to searching for riboswitch elements , 2006, ISMB.

[38]  D. Turner,et al.  Predicting oligonucleotide affinity to nucleic acid targets. , 1999, RNA.

[39]  S. Eddy Non–coding RNA genes and the modern RNA world , 2001, Nature Reviews Genetics.