A graph theoretical approach for predicting common RNA secondary structure motifs including pseudoknots in unaligned sequences

Motivation: RNA structure motifs contained in mRNAs have been found to play important roles in regulating gene expression. However, identification of novel RNA regulatory motifs using computational methods has not been widely explored. Effective tools for predicting novel RNA regulatory motifs based on genomic sequences are needed. Results: We present a new method for predicting common RNA secondary structure motifs in a set of functionally or evolutionarily related RNA sequences. This method is based on comparison of stems (palindromic helices) between sequences and is implemented by applying graph-theoretical approaches. It first finds all possible stable stems in each sequence and compares stems pairwise between sequences by some defined features to find stems conserved across any two sequences. Then by applying a maximum clique finding algorithm, it finds all significant stems conserved across at least k sequences. Finally, it assembles in topological order all possible compatible conserved stems shared by at least k sequences and reports a number of the best assembled stem sets as the best candidate common structure motifs. This method does not require prior structural alignment of the sequences and is able to detect pseudoknot structures. We have tested this approach on some RNA sequences with known secondary structures, in which it is capable of detecting the real structures completely or partially correctly and outperforms other existing programs for similar purposes. Availability: The algorithm has been implemented in C++ in a program called comRNA, which is available at http:// ural.wustl.edu/softwares.html Contact: stormo@genetics.wustl.edu

[1]  D. T. Jones,et al.  A new approach to protein fold recognition , 1992, Nature.

[2]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[3]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[4]  Niles A. Pierce,et al.  A partition function algorithm for nucleic acid secondary structure including pseudoknots , 2003, J. Comput. Chem..

[5]  R. Durbin,et al.  RNA sequence analysis using covariance models. , 1994, Nucleic acids research.

[6]  Gary D. Stormo,et al.  An RNA folding method capable of identifying pseudoknots and base triples , 1998, Bioinform..

[7]  Graziano Pesole,et al.  PatSearch: a pattern matcher software that finds functional elements in nucleotide and protein sequences and assesses their statistical significance , 2000, Bioinform..

[8]  D. Turner,et al.  Improved free-energy parameters for predictions of RNA duplex stability. , 1986, Proceedings of the National Academy of Sciences of the United States of America.

[9]  S. Le,et al.  Prediction of common secondary structures of RNAs: a genetic algorithm approach. , 2000, Nucleic acids research.

[10]  Ian Holmes,et al.  Pairwise RNA Structure Comparison with Stochastic Context-Free Grammars , 2001, Pacific Symposium on Biocomputing.

[11]  J. Crutchfield,et al.  Regularities unseen, randomness observed: levels of entropy convergence. , 2001, Chaos.

[12]  M. Zuker Prediction of RNA secondary structure by energy minimization. , 1994, Methods in molecular biology.

[13]  Volker A. Eyrich,et al.  EVA: Large‐scale analysis of secondary structure prediction , 2001, Proteins.

[14]  Laurie J. Heyer,et al.  Finding the most significant common sequence and structure motifs in a set of RNA sequences. , 1997, Nucleic acids research.

[15]  Ga Miller,et al.  Note on the bias of information estimates , 1955 .

[16]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[17]  Alfonso Valencia,et al.  CAFASP3 in the spotlight of EVA , 2003, Proteins.

[18]  Brendan J. Frey,et al.  Factor graphs and the sum-product algorithm , 2001, IEEE Trans. Inf. Theory.

[19]  B. Rost,et al.  Critical assessment of methods of protein structure prediction (CASP)—Round 6 , 2005, Proteins.

[20]  Tim J. P. Hubbard,et al.  SCOP database in 2002: refinements accommodate structural genomics , 2002, Nucleic Acids Res..

[21]  P. Argos,et al.  Seventy‐five percent accuracy in protein secondary structure prediction , 1997, Proteins.

[22]  Gary D. Stormo,et al.  Do mRNAs act as direct sensors of small molecules to control their expression? , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[23]  Patrice Koehl,et al.  ASTRAL compendium enhancements , 2002, Nucleic Acids Res..

[24]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[25]  Gary D. Stormo,et al.  Graph-Theoretic Approach to RNA Modeling Using Comparative Data , 1995, ISMB.

[26]  P. Argos,et al.  Knowledge‐based protein secondary structure assignment , 1995, Proteins.

[27]  D. Sankoff Simultaneous Solution of the RNA Folding, Alignment and Protosequence Problems , 1985 .

[28]  G J Barton,et al.  Application of multiple sequence alignment profiles to improve protein secondary structure prediction , 2000, Proteins.

[29]  D. Draper,et al.  Allosteric mechanism for translational repression in the Escherichia coli alpha operon. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[30]  B. Rost,et al.  Prediction of protein secondary structure at better than 70% accuracy. , 1993, Journal of molecular biology.

[31]  Terrence G. Oas,et al.  Preorganized secondary structure as an important determinant of fast protein folding , 2001, Nature Structural Biology.

[32]  Bjarne Knudsen,et al.  RNA secondary structure prediction using stochastic context-free grammars and evolutionary history , 1999, Bioinform..

[33]  E Rivas,et al.  A dynamic programming algorithm for RNA structure prediction including pseudoknots. , 1998, Journal of molecular biology.

[34]  G. Stormo,et al.  Discovering common stem-loop motifs in unaligned RNA sequences. , 2001, Nucleic acids research.

[35]  N. Pace,et al.  Phylogenetic comparative analysis of RNA secondary structure. , 1989, Methods in enzymology.

[36]  Henry Soldano,et al.  A new method to predict the consensus secondary structure of a set of unaligned RNA sequences , 1999, Bioinform..

[37]  David R. Gilbert,et al.  Motif-based searching in TOPS protein topology databases , 1999, Bioinform..

[38]  Takuji Nishimura,et al.  Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator , 1998, TOMC.

[39]  S. Henikoff,et al.  Automated assembly of protein blocks for database searching. , 1991, Nucleic acids research.

[40]  D. Eisenberg,et al.  A method to identify protein sequences that fold into a known three-dimensional structure. , 1991, Science.

[41]  M. E. Galassi,et al.  GNU SCIENTI C LIBRARY REFERENCE MANUAL , 2005 .

[42]  Pavel A. Pevzner,et al.  Combinatorial Approaches to Finding Subtle Signals in DNA Sequences , 2000, ISMB.

[43]  R Nussinov,et al.  Fast protein fold recognition via sequence to structure alignment and contact capacity potentials. , 1996, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[44]  T. Speed,et al.  Biological Sequence Analysis , 1998 .

[45]  A. Serganov,et al.  Do mRNA and rRNA binding sites of E.coli ribosomal protein S15 share common structural determinants? , 2002, Journal of molecular biology.

[46]  F. H. D. van Batenburg,et al.  PseudoBase: structural information on RNA pseudoknots , 2001, Nucleic Acids Res..

[47]  J M Chandonia,et al.  New methods for accurate prediction of protein secondary structure , 1999, Proteins.

[48]  A. Szent-Gyorgyi,et al.  Role of proline in polypeptide chain configuration of proteins. , 1957, Science.

[49]  G J Barton,et al.  Evaluation and improvement of multiple sequence methods for protein secondary structure prediction , 1999, Proteins.

[50]  Panos M. Pardalos,et al.  The maximum clique problem , 1994, J. Glob. Optim..

[51]  A. E. Walter,et al.  Coaxial stacking of helixes enhances binding of oligoribonucleotides and improves predictions of RNA folding. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[52]  Russell L. Malmberg,et al.  Stochastic modeling of RNA pseudoknotted structures: a grammatical approach , 2003, ISMB.

[53]  Tore Grünert,et al.  Finding all k-cliques in k-partite graphs, an application in textile engineering , 2002, Comput. Oper. Res..

[54]  D. Turner,et al.  Dynalign: an algorithm for finding the secondary structure common to two RNA sequences. , 2002, Journal of molecular biology.

[55]  B. Rost Review: protein secondary structure prediction continues to rise. , 2001, Journal of structural biology.

[56]  David Haussler,et al.  Recent Methods for RNA Modeling Using Stochastic Context-Free Grammars , 1994, CPM.

[57]  J M Chandonia,et al.  Neural networks for secondary structure and structural class predictions , 1995, Protein science : a publication of the Protein Society.

[58]  G. Rose,et al.  Is protein folding hierarchic? I. Local structure and peptide folding. , 1999, Trends in biochemical sciences.

[59]  A. Viari,et al.  Palingol: a declarative programming language to describe nucleic acids' secondary structures and to scan sequence database. , 1996, Nucleic acids research.

[60]  Yuh-Jyh Hu Prediction of consensus structural motifs in a family of coregulated RNA sequences. , 2002, Nucleic acids research.

[61]  D. Ecker,et al.  RNAMotif, an RNA secondary structure definition and search algorithm. , 2001, Nucleic acids research.

[62]  B. Rost,et al.  Alignments grow, secondary structure prediction improves , 2002, Proteins.

[63]  Hélène Touzet,et al.  Finding the common structure shared by two homologous RNAs , 2003, Bioinform..

[64]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[65]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[66]  Jeffrey E. Barrick,et al.  Riboswitches Control Fundamental Biochemical Pathways in Bacillus subtilis and Other Bacteria , 2003, Cell.

[67]  Daniel Gautheret,et al.  An RNA pattern matching program with enhanced performance and portability , 1994, Comput. Appl. Biosci..

[68]  Douglas L. Brutlag,et al.  Bayesian Segmentation of Protein Secondary Structure , 2000, J. Comput. Biol..

[69]  R. K. Shyamasundar,et al.  Introduction to algorithms , 1996 .

[70]  C. Ehresmann,et al.  Pseudoknot and translational control in the expression of the S15 ribosomal protein , 1996, Biochimie.

[71]  I. Tinoco,et al.  Estimation of Secondary Structure in Ribonucleic Acids , 1971, Nature.

[72]  D. Haussler,et al.  Information‐theoretic dissection of pairwise contact potentials , 2002, Proteins.

[73]  David S. Johnson,et al.  Dimacs series in discrete mathematics and theoretical computer science , 1996 .

[74]  J M Chandonia,et al.  The importance of larger data sets for protein secondary structure prediction with neural networks , 1996, Protein science : a publication of the Protein Society.

[75]  C. Gissi,et al.  Structural and functional features of eukaryotic mRNA untranslated regions. , 2001, Gene.

[76]  David R. Gilbert,et al.  A Computer System to Perform Structure Comparison using Representations of Protein Structure , 2002, Comput. Chem..

[77]  K-L Ting,et al.  Combining the GOR V algorithm with evolutionary information for protein secondary structure prediction from amino acid sequence , 2002, Proteins.

[78]  Liam J. McGuffin,et al.  Improvement of the GenTHREADER Method for Genomic Fold Recognition , 2003, Bioinform..

[79]  The Distribution Of Entropy Estimators Based On Maximum Mean Log-Likelihood , 2000 .

[80]  H Herzel,et al.  Information content of protein sequences. , 2000, Journal of theoretical biology.

[81]  Patrice Koehl,et al.  The ASTRAL compendium for protein structure and sequence analysis , 2000, Nucleic Acids Res..

[82]  R A Goldstein,et al.  Predicting protein secondary structure with probabilistic schemata of evolutionarily derived information , 1997, Protein science : a publication of the Protein Society.