Improved covariance model parameter estimation using RNA thermodynamic properties

Covariance models are a powerful description of non-coding RNA (ncRNA) families that can be used to search nucleotide databases for new members of these ncRNA families. Currently, estimation of the parameters of a covariance model (state transition and emission scores) is based only on the observed frequencies of mutations, insertions, and deletions in known ncRNA sequences. For families with very few known members, this can result in rather uninformative models where the consensus sequence has a good score and most deviations from consensus have a fairly uniform poor score. It is proposed here to combine the traditional observed-frequency information with known information about free energy changes in RNA helix formation and loop length changes. More thermodynamically probable deviations from the consensus sequence will then be favored in database search. The thermodynamic information may be incorporated into the models as informative priors that depend on neighboring consensus nucleotides and on loop lengths.

[1]  M. Zuker Computer prediction of RNA structure. , 1989, Methods in Enzymology.

[2]  Sean R. Eddy,et al.  Query-Dependent Banding (QDB) for Faster RNA Similarity Searches , 2007, PLoS Comput. Biol..

[3]  Zasha Weinberg,et al.  Faster genome annotation of non-coding RNA families without loss of accuracy , 2004, RECOMB.

[4]  David Haussler,et al.  Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology , 1996, Comput. Appl. Biosci..

[5]  Andrew Hendriks,et al.  Analysis of Thermodynamic Models and Performance in RnaPredict - An Evolutionary Algorithm for RNA Folding , 2006, 2006 IEEE Symposium on Computational Intelligence and Bioinformatics and Computational Biology.

[6]  Luc Jaeger,et al.  RNA pseudoknots , 1992, Current Biology.

[7]  S. Eddy Hidden Markov models. , 1996, Current opinion in structural biology.

[8]  Sean R. Eddy,et al.  Rfam: annotating non-coding RNAs in complete genomes , 2004, Nucleic Acids Res..

[9]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[10]  Scott F. Smith Covariance Searches for ncRNA Gene Finding , 2006, 2006 IEEE Symposium on Computational Intelligence and Bioinformatics and Computational Biology.

[11]  R. Durbin,et al.  RNA sequence analysis using covariance models. , 1994, Nucleic acids research.

[12]  D. Turner,et al.  Improved free-energy parameters for predictions of RNA duplex stability. , 1986, Proceedings of the National Academy of Sciences of the United States of America.