Discovering common stem-loop motifs in unaligned RNA sequences.

Post-transcriptional regulation of gene expression is often accomplished by proteins binding to specific sequence motifs in mRNA molecules, to affect their translation or stability. The motifs are often composed of a combination of sequence and structural constraints such that the overall structure is preserved even though much of the primary sequence is variable. While several methods exist to discover transcriptional regulatory sites in the DNA sequences of coregulated genes, the RNA motif discovery problem is much more difficult because of covariation in the positions. We describe the combined use of two approaches for RNA structure prediction, FOLDALIGN and COVE, that together can discover and model stem-loop RNA motifs in unaligned sequences, such as UTRs from post-transcriptionally coregulated genes. We evaluate the method on two datasets, one a section of rRNA genes with randomly truncated ends so that a global alignment is not possible, and the other a hyper-variable collection of IRE-like elements that were inserted into randomized UTR sequences. In both cases the combined method identified the motifs correctly, and in the rRNA example we show that it is capable of determining the structure, which includes bulge and internal loops as well as a variable length hairpin loop. Those automated results are quantitatively evaluated and found to agree closely with structures contained in curated databases, with correlation coefficients up to 0.9. A basic server, Stem-Loop Align SearcH (SLASH), which will perform stem-loop searches in unaligned RNA sequences, is available at http://www.bioinf.au.dk/slash/.

[1]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[2]  C. Woese,et al.  5S RNA secondary structure , 1975, Nature.

[3]  R. Nussinov,et al.  Fast algorithm for predicting the secondary structure of single-stranded RNA. , 1980, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Michael Zuker,et al.  Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information , 1981, Nucleic Acids Res..

[5]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[6]  S. Altschul,et al.  Significance of nucleotide sequence alignments: a method for random sequence permutation that preserves dinucleotide and codon usage. , 1985, Molecular biology and evolution.

[7]  D. Sankoff Simultaneous Solution of the RNA Folding, Alignment and Protosequence Problems , 1985 .

[8]  G. Stormo,et al.  Identifying protein-binding sites from unaligned DNA fragments. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[9]  N. Pace,et al.  Phylogenetic comparative analysis and the secondary structure of ribonuclease P RNA--a review. , 1989, Gene.

[10]  Gary D. Stormo,et al.  Identification of consensus patterns in unaligned DNA sequences known to be functionally related , 1990, Comput. Appl. Biosci..

[11]  S. Knudsen,et al.  Prediction of human mRNA donor and acceptor sites from the DNA sequence. , 1991, Journal of molecular biology.

[12]  G. Stormo,et al.  Identifying constraints on the higher-order structure of RNA: continued development and application of comparative sequence analysis methods. , 1992, Nucleic acids research.

[13]  B. Rost,et al.  Prediction of protein secondary structure at better than 70% accuracy. , 1993, Journal of molecular biology.

[14]  M. Price,et al.  Accessing integrated genomic data using GenoBase: A tutorial, Part 1 , 1993 .

[15]  R. C. Underwood,et al.  Stochastic context-free grammars for tRNA modeling. , 1994, Nucleic acids research.

[16]  U. Hobohm,et al.  Enlarged representative set of protein structures , 1994, Protein science : a publication of the Protein Society.

[17]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[18]  R. Durbin,et al.  RNA sequence analysis using covariance models. , 1994, Nucleic acids research.

[19]  M. Hentze,et al.  Finding the hairpin in the haystack: searching for RNA motifs. , 1995, Trends in genetics : TIG.

[20]  D Gautheret,et al.  Identification of base-triples in RNA using comparative sequence analysis. , 1995, Journal of molecular biology.

[21]  François Major,et al.  Symbolic Generation and Clustering of RNA 3-D Motifs , 1995, ISMB.

[22]  M. Hentze,et al.  Molecular control of vertebrate iron metabolism: mRNA-based regulatory circuits operated by iron, nitric oxide, and oxidative stress. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[23]  R. Durbin,et al.  Pfam: A comprehensive database of protein domain families based on seed alignments , 1997, Proteins.

[24]  Gary D. Stormo,et al.  Displaying the information contents of structural RNA alignments: the structure logos , 1997, Comput. Appl. Biosci..

[25]  Laurie J. Heyer,et al.  Finding the most significant common sequence and structure motifs in a set of RNA sequences. , 1997, Nucleic acids research.

[26]  Gary D. Stormo,et al.  Finding Common Sequence and Structure Motifs in a Set of RNA Sequences , 1997, ISMB.

[27]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[28]  Gary D. Stormo,et al.  An RNA folding method capable of identifying pseudoknots and base triples , 1998, Bioinform..

[29]  J. Collado-Vides,et al.  Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. , 1998, Journal of molecular biology.

[30]  G. Church,et al.  Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation , 1998, Nature Biotechnology.

[31]  E. Tillier,et al.  High apparent rate of simultaneous compensatory base-pair substitutions in ribosomal RNA. , 1998, Genetics.

[32]  David R. Gilbert,et al.  Approaches to the Automatic Discovery of Patterns in Biosequences , 1998, J. Comput. Biol..

[33]  Michael Schöniger,et al.  Toward Assigning Helical Regions in Alignments of Ribosomal RNA and Testing the Appropriateness of Evolutionary Models , 1999, Journal of Molecular Evolution.

[34]  Gary D. Stormo,et al.  Identifying DNA and protein patterns with statistically significant alignments of multiple sequences , 1999, Bioinform..

[35]  J. Sabina,et al.  Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. , 1999, Journal of molecular biology.

[36]  Ole Lund,et al.  MatrixPlot: visualizing sequence constraints , 1999, Bioinform..

[37]  Michael Zuker,et al.  Algorithms and Thermodynamics for RNA Secondary Structure Prediction: A Practical Guide , 1999 .

[38]  S. Gygi,et al.  Correlation between Protein and mRNA Abundance in Yeast , 1999, Molecular and Cellular Biology.

[39]  S. Eddy,et al.  A computational screen for methylation guide snoRNAs in yeast. , 1999, Science.

[40]  J. Hein,et al.  Combining many multiple alignments in one improved alignment , 1999, Bioinform..

[41]  J. Hein,et al.  Statistical alignment: computational properties, homology testing and goodness-of-fit. , 2000, Journal of molecular biology.

[42]  Laurie J. Heyer,et al.  A Generalized Erdös-Rényi Law for Sequence Analysis Problems , 2000 .

[43]  James R. Cole,et al.  The RDP (Ribosomal Database Project) continues , 2000, Nucleic Acids Res..

[44]  J. Parsch,et al.  Comparative sequence analysis and patterns of covariation in RNA secondary structures. , 2000, Genetics.

[45]  Yves Van de Peer,et al.  The European Small Subunit Ribosomal RNA database , 2000, Nucleic Acids Res..

[46]  C. Pabo,et al.  Geometric analysis and comparison of protein-DNA interfaces: why is there no simple code for recognition? , 2000, Journal of molecular biology.

[47]  S. Le,et al.  Prediction of common secondary structures of RNAs: a genetic algorithm approach. , 2000, Nucleic acids research.

[48]  Graziano Pesole,et al.  UTRdb and UTRsite: specialized databases of sequences and functional elements of 5' and 3' untranslated regions of eukaryotic mRNAs , 2000, Nucleic Acids Res..

[49]  Michael P. S. Brown,et al.  Small Subunit Ribosomal RNA Modeling Using Stochastic Context-Free Grammars , 2000, ISMB.

[50]  Gary D. Stormo,et al.  Phylogenetically enhanced statistical tools for RNA structure prediction , 2000, Bioinform..

[51]  Jotun Hein,et al.  An Algorithm for Statistical Alignment of Sequences Related by a Binary Tree , 2000, Pacific Symposium on Biocomputing.