Pairwise local structural alignment of RNA sequences with sequence similarity less than 40%

MOTIVATION Searching for non-coding RNA (ncRNA) genes and structural RNA elements (eleRNA) are major challenges in gene finding today as these often are conserved in structure rather than in sequence. Even though the number of available methods is growing, it is still of interest to pairwise detect two genes with low sequence similarity, where the genes are part of a larger genomic region. RESULTS Here we present such an approach for pairwise local alignment which is based on foldalign and the Sankoff algorithm for simultaneous structural alignment of multiple sequences. We include the ability to conduct mutual scans of two sequences of arbitrary length while searching for common local structural motifs of some maximum length. This drastically reduces the complexity of the algorithm. The scoring scheme includes structural parameters corresponding to those available for free energy as well as for substitution matrices similar to RIBOSUM. The new foldalign implementation is tested on a dataset where the ncRNAs and eleRNAs have sequence similarity <40% and where the ncRNAs and eleRNAs are energetically indistinguishable from the surrounding genomic sequence context. The method is tested in two ways: (1) its ability to find the common structure between the genes only and (2) its ability to locate ncRNAs and eleRNAs in a genomic context. In case (1), it makes sense to compare with methods like Dynalign, and the performances are very similar, but foldalign is substantially faster. The structure prediction performance for a family is typically around 0.7 using Matthews correlation coefficient. In case (2), the algorithm is successful at locating RNA families with an average sensitivity of 0.8 and a positive predictive value of 0.9 using a BLAST-like hit selection scheme. AVAILABILITY The program is available online at http://foldalign.kvl.dk/

[1]  Zasha Weinberg,et al.  Exploiting conserved structure for faster annotation of non-coding RNAs without loss of accuracy , 2004, ISMB/ECCB.

[2]  Hélène Touzet,et al.  Finding the common structure shared by two homologous RNAs , 2003, Bioinform..

[3]  Christian Zwieb,et al.  SRPDB: Signal Recognition Particle Database , 2003, Nucleic Acids Res..

[4]  Jan Gorodkin,et al.  Evolutionary rate variation and RNA secondary structure prediction , 2004, Comput. Biol. Chem..

[5]  Maciej Szymanski,et al.  5S Ribosomal RNA Database , 2002, Nucleic Acids Res..

[6]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Christian Zwieb,et al.  Semi-automated update and cleanup of structural RNA alignment databases , 2001, Bioinform..

[8]  Xing Xu,et al.  A graph theoretical approach for predicting common RNA secondary structure motifs including pseudoknots in unaligned sequences , 2004, Bioinform..

[9]  Christian Zwieb The uRNA database , 1997, Nucleic Acids Res..

[10]  J. Miranda-Ríos,et al.  A conserved RNA structure (thi box) is involved in regulation of thiamin biosynthetic gene expression in bacteria , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[11]  M. Saraste,et al.  FEBS Lett , 2000 .

[12]  Yves Van de Peer,et al.  Database on the structure of small ribosomal subunit RNA , 1996, Nucleic Acids Res..

[13]  A. Hüttenhofer,et al.  RNomics: identification and function of small, non-messenger RNAs. , 2002, Current opinion in chemical biology.

[14]  J. Jonsson,et al.  Two-dimensional conformation-dependent electrophoresis (2D-CDE) to separate DNA fragments containing unmatched bulge from complex DNA samples. , 2004, Nucleic acids research.

[15]  Ian Holmes,et al.  A probabilistic model for the evolution of RNA structure , 2004, BMC Bioinformatics.

[16]  S. Altschul,et al.  The estimation of statistical parameters for local alignment score distributions. , 2001, Nucleic acids research.

[17]  Peter F. Stadler,et al.  Alignment of RNA base pairing probability matrices , 2004, Bioinform..

[18]  Zasha Weinberg,et al.  Faster genome annotation of non-coding RNA families without loss of accuracy , 2004, RECOMB.

[19]  Michael Zuker,et al.  Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information , 1981, Nucleic Acids Res..

[20]  Michael P. S. Brown,et al.  Small Subunit Ribosomal RNA Modeling Using Stochastic Context-Free Grammars , 2000, ISMB.

[21]  S. Eddy,et al.  A computational screen for methylation guide snoRNAs in yeast. , 1999, Science.

[22]  Laurie J. Heyer,et al.  A Generalized Erdös-Rényi Law for Sequence Analysis Problems , 2000 .

[23]  Sean R. Eddy,et al.  RSEARCH: Finding homologs of single structured RNA sequences , 2003, BMC Bioinformatics.

[24]  G. Stormo,et al.  Discovering common stem-loop motifs in unaligned RNA sequences. , 2001, Nucleic acids research.

[25]  Peter F. Stadler,et al.  Prediction of locally stable RNA secondary structures for genome-wide surveys , 2004, Bioinform..

[26]  J. Hein,et al.  Combining many multiple alignments in one improved alignment , 1999, Bioinform..

[27]  Laurie J. Heyer,et al.  Finding the most significant common sequence and structure motifs in a set of RNA sequences. , 1997, Nucleic acids research.

[28]  Sean R. Eddy,et al.  Rfam: an RNA family database , 2003, Nucleic Acids Res..

[29]  L. Lim,et al.  An Abundant Class of Tiny RNAs with Probable Regulatory Roles in Caenorhabditis elegans , 2001, Science.

[30]  Jeffrey E. Barrick,et al.  Riboswitches Control Fundamental Biochemical Pathways in Bacillus subtilis and Other Bacteria , 2003, Cell.

[31]  J Gorodkin,et al.  A mini-greedy algorithm for faster structural RNA stem-loop search. , 2001, Genome informatics. International Conference on Genome Informatics.

[32]  Gary D. Stormo,et al.  Finding Common Sequence and Structure Motifs in a Set of RNA Sequences , 1997, ISMB.

[33]  S. Eddy Computational Genomics of Noncoding RNA Genes , 2002, Cell.

[34]  G. Mauri,et al.  An algorithm for finding conserved secondary structure motifs in unaligned RNA sequences , 2008, Journal of Computer Science and Technology.

[35]  D. Ecker,et al.  RNAMotif, an RNA secondary structure definition and search algorithm. , 2001, Nucleic acids research.

[36]  Yves Van de Peer,et al.  Database on the structure of small ribosomal subunit RNA , 1998, Nucleic Acids Res..

[37]  D. Turner,et al.  Dynalign: an algorithm for finding the secondary structure common to two RNA sequences. , 2002, Journal of molecular biology.

[38]  Sean R. Eddy,et al.  A memory-efficient dynamic programming algorithm for optimal alignment of a sequence to an RNA secondary structure , 2002, BMC Bioinformatics.

[39]  A. Krogh,et al.  No evidence that mRNAs have lower folding free energies than random sequences with the same dinucleotide distribution. , 1999, Nucleic acids research.

[40]  Michael Zuker,et al.  Algorithms and Thermodynamics for RNA Secondary Structure Prediction: A Practical Guide , 1999 .

[41]  G. Stormo,et al.  A graph theoretical approach for predicting common RNA secondary structure motifs including pseudoknots in unaligned sequences. , 2004, Bioinformatics.

[42]  R. Breaker,et al.  An mRNA structure that controls gene expression by binding FMN , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[43]  Elena Rivas,et al.  Noncoding RNA gene detection using comparative sequence analysis , 2001, BMC Bioinformatics.

[44]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[45]  Sergey Steinberg,et al.  Compilation of tRNA sequences and sequences of tRNA genes , 2004, Nucleic Acids Res..

[46]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[47]  P. Stadler,et al.  Secondary structure prediction for aligned RNA sequences. , 2002, Journal of molecular biology.

[48]  D. Turner,et al.  Thermodynamic parameters for an expanded nearest-neighbor model for formation of RNA duplexes with Watson-Crick base pairs. , 1998, Biochemistry.

[49]  S. Eddy,et al.  tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. , 1997, Nucleic acids research.

[50]  J. Sabina,et al.  Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. , 1999, Journal of molecular biology.

[51]  David L. Wheeler,et al.  GenBank: update , 2004, Nucleic Acids Res..

[52]  Rolf Olsen,et al.  Rapid Assessment of Extremal Statistics for Gapped Local Alignment , 1999, ISMB.

[53]  V. Ambros,et al.  An Extensive Class of Small RNAs in Caenorhabditis elegans , 2001, Science.

[54]  Nobuo Yamashita,et al.  Thiamine‐regulated gene expression of Aspergillus oryzae thiA requires splicing of the intron containing a riboswitch‐like domain in the 5′‐UTR , 2003, FEBS letters.

[55]  R. Breaker,et al.  Adenine riboswitches and gene activation by disruption of a transcription terminator , 2004, Nature Structural &Molecular Biology.

[56]  T. Tuschl,et al.  Identification of Novel Genes Coding for Small Expressed RNAs , 2001, Science.

[57]  D. Sankoff Simultaneous Solution of the RNA Folding, Alignment and Protosequence Problems , 1985 .

[58]  M. Gelfand,et al.  Comparative Genomics of Thiamin Biosynthesis in Procaryotes , 2002, The Journal of Biological Chemistry.

[59]  I. Hofacker,et al.  Consensus folding of aligned sequences as a new measure for the detection of functional RNAs by comparative genomics. , 2004, Journal of molecular biology.

[60]  Margaret S. Ebert,et al.  An mRNA structure in bacteria that controls gene expression by binding lysine. , 2003, Genes & development.