Prediction of RNA secondary structure by maximizing pseudo-expected accuracy

BackgroundRecent studies have revealed the importance of considering the entire distribution of possible secondary structures in RNA secondary structure predictions; therefore, a new type of estimator is proposed including the maximum expected accuracy (MEA) estimator. The MEA-based estimators have been designed to maximize the expected accuracy of the base-pairs and have achieved the highest level of accuracy. Those methods, however, do not give the single best prediction of the structure, but employ parameters to control the trade-off between the sensitivity and the positive predictive value (PPV). It is unclear what parameter value we should use, and even the well-trained default parameter value does not, in general, give the best result in popular accuracy measures to each RNA sequence.ResultsInstead of using the expected values of the popular accuracy measures for RNA secondary structure prediction, which is difficult to be calculated, the pseudo-expected accuracy, which can easily be computed from base-pairing probabilities, is introduced. It is shown that the pseudo-expected accuracy is a good approximation in terms of sensitivity, PPV, MCC, or F-score. The pseudo-expected accuracy can be approximately maximized for each RNA sequence by stochastic sampling. It is also shown that well-balanced secondary structures between sensitivity and PPV can be predicted with a small computational overhead by combining the pseudo-expected accuracy of MCC or F-score with the γ-centroid estimator.ConclusionsThis study gives not only a method for predicting the secondary structure that balances between sensitivity and PPV, but also a general method for approximately maximizing the (pseudo-)expected accuracy with respect to various evaluation measures including MCC and F-score.

[1]  Kiyoshi Asai,et al.  Improving the accuracy of predicting secondary structure for aligned RNA sequences , 2010, Nucleic Acids Res..

[2]  Kiyoshi Asai,et al.  CentroidAlign: fast and accurate aligner for structured RNAs by maximizing expected sum-of-pairs score , 2009, Bioinform..

[3]  Walter Fontana,et al.  Fast folding and comparison of RNA secondary structures , 1994 .

[4]  Sean R. Eddy,et al.  Biological sequence analysis: Contents , 1998 .

[5]  J. Gorodkin,et al.  Unifying evolutionary and thermodynamic information for RNA folding of multiple alignments , 2008, Nucleic acids research.

[6]  Kevin P. Murphy,et al.  Efficient parameter estimation for RNA secondary structure prediction , 2007, ISMB/ECCB.

[7]  C. Lawrence,et al.  RNA secondary structure prediction by centroids in a Boltzmann weighted ensemble. , 2005, RNA.

[8]  J. McCaskill The equilibrium partition function and base pair binding probabilities for RNA secondary structure , 1990, Biopolymers.

[9]  Bjarne Knudsen,et al.  Pfold: RNA Secondary Structure Prediction Using Stochastic Context-Free Grammars , 2003 .

[10]  D. Mathews,et al.  Improved RNA secondary structure prediction by maximizing expected pair accuracy. , 2009, RNA.

[11]  S. Schroeder Advances in RNA Structure Prediction from Sequence: New Tools for Generating Hypotheses about Viral RNA Structure-Function Relationships , 2009, Journal of Virology.

[12]  P. Stadler,et al.  Secondary structure prediction for aligned RNA sequences. , 2002, Journal of molecular biology.

[13]  Kiyoshi Asai,et al.  Prediction of RNA secondary structure using generalized centroid estimators , 2009, Bioinform..

[14]  Kiyoshi Asai,et al.  Predictions of RNA secondary structure by combining homologous sequence information , 2009, Bioinform..

[15]  Kiyoshi Asai,et al.  Robust prediction of consensus secondary structures using averaged base pairing probability matrices , 2007, Bioinform..

[16]  Erik L. L. Sonnhammer,et al.  An HMM posterior decoder for sequence feature prediction that includes homology information , 2005, ISMB.

[17]  Fariza Tahi,et al.  Tfold: efficient in silico prediction of non-coding RNA secondary structures , 2010, Nucleic acids research.

[18]  Ian Holmes,et al.  Dynamic Programming Alignment Accuracy , 1998, J. Comput. Biol..

[19]  Serafim Batzoglou,et al.  CONTRAfold: RNA secondary structure prediction without physics-based models , 2006, ISMB.

[20]  Anne Condon,et al.  RNA STRAND: The RNA Secondary Structure and Statistical Analysis Database , 2008, BMC Bioinformatics.

[21]  Chuong B. Do,et al.  CONTRAST: a discriminative, phylogeny-free approach to multiple informant de novo gene prediction , 2007, Genome Biology.

[22]  Michael Zuker,et al.  Mfold web server for nucleic acid folding and hybridization prediction , 2003, Nucleic Acids Res..

[23]  Sebastian Will,et al.  RNAalifold: improved consensus structure prediction for RNA alignments , 2008, BMC Bioinformatics.

[24]  D. Turner,et al.  Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[25]  Robert D. Finn,et al.  Rfam: updates to the RNA families database , 2008, Nucleic Acids Res..

[26]  Kiyoshi Asai,et al.  CentroidFold: a web server for RNA secondary structure prediction , 2009, Nucleic Acids Res..

[27]  C. Lawrence,et al.  A statistical sampling algorithm for RNA secondary structure prediction. , 2003, Nucleic acids research.

[28]  Lior Pachter,et al.  Specific alignment of structured RNA: stochastic grammars and sequence annealing , 2008, Bioinform..

[29]  Ye Ding,et al.  Sfold web server for statistical folding and rational design of nucleic acids , 2004, Nucleic Acids Res..

[30]  Pierre Baldi,et al.  Assessing the accuracy of prediction algorithms for classification: an overview , 2000, Bioinform..

[31]  Michael Zuker,et al.  Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information , 1981, Nucleic Acids Res..

[32]  Paul Horton,et al.  Parameters for accurate genome alignment , 2010, BMC Bioinformatics.

[33]  Sean R. Eddy,et al.  Rfam: annotating non-coding RNAs in complete genomes , 2004, Nucleic Acids Res..

[34]  F. Major,et al.  The MC-Fold and MC-Sym pipeline infers RNA structure from sequence data , 2008, Nature.

[35]  BMC Bioinformatics , 2005 .

[36]  C. Lawrence,et al.  Centroid estimation in discrete high-dimensional spaces with applications in biology , 2008, Proceedings of the National Academy of Sciences.

[37]  Tomás Vinar,et al.  The Highest Expected Reward Decoding for HMMs with Application to Recombination Detection , 2010, CPM.

[38]  Durbin,et al.  Biological Sequence Analysis , 1998 .

[39]  Lior Pachter,et al.  Fast Statistical Alignment , 2009, PLoS Comput. Biol..