An improved Four-Russians method and sparsified Four-Russians algorithm for RNA folding

BackgroundThe basic RNA secondary structure prediction problem or single sequence folding problem (SSF) was solved 35 years ago by a now well-known $$O(n^3)$$O(n3)-time dynamic programming method. Recently three methodologies—Valiant, Four-Russians, and Sparsification—have been applied to speedup RNA secondary structure prediction. The sparsification method exploits two properties of the input: the number of subsequence Z with the endpoints belonging to the optimal folding set and the maximum number base-pairs L. These sparsity properties satisfy $$0 \le L \le n / 2$$0≤L≤n/2 and $$n \le Z \le n^2 / 2$$n≤Z≤n2/2, and the method reduces the algorithmic running time to O(LZ). While the Four-Russians method utilizes tabling partial results.ResultsIn this paper, we explore three different algorithmic speedups. We first expand the reformulate the single sequence folding Four-Russians $$\Theta \left(\frac{n^3}{\log ^2 n}\right)$$Θn3log2n-time algorithm, to utilize an on-demand lookup table. Second, we create a framework that combines the fastest Sparsification and new fastest on-demand Four-Russians methods. This combined method has worst-case running time of $$O(\tilde{L}\tilde{Z})$$O(L~Z~), where $$\frac{{L}}{\log n} \le \tilde{L}\le min\left({L},\frac{n}{\log n}\right)$$Llogn≤L~≤minL,nlogn and $$\frac{{Z}}{\log n}\le \tilde{Z} \le min\left({Z},\frac{n^2}{\log n}\right)$$Zlogn≤Z~≤minZ,n2logn. Third we update the Four-Russians formulation to achieve an on-demand $$O( n^2/ \log ^2n )$$O(n2/log2n)-time parallel algorithm. This then leads to an asymptotic speedup of $$O(\tilde{L}\tilde{Z_j})$$O(L~Zj~) where $$\frac{{Z_j}}{\log n}\le \tilde{Z_j} \le min\left({Z_j},\frac{n}{\log n}\right)$$Zjlogn≤Zj~≤minZj,nlogn and $$Z_j$$Zj the number of subsequence with the endpoint j belonging to the optimal folding set.ConclusionsThe on-demand formulation not only removes all extraneous computation and allows us to incorporate more realistic scoring schemes, but leads us to take advantage of the sparsity properties. Through asymptotic analysis and empirical testing on the base-pair maximization variant and a more biologically informative scoring scheme, we show that this Sparse Four-Russians framework is able to achieve a speedup on every problem instance, that is asymptotically never worse, and empirically better than achieved by the minimum of the two methods alone.

[1]  Michael Zuker,et al.  Mfold web server for nucleic acid folding and hybridization prediction , 2003, Nucleic Acids Res..

[2]  Dan Gusfield,et al.  A Worst-Case and Practical Speedup for the RNA Co-folding Problem Using the Four-Russians Idea , 2010, WABI.

[3]  V. Juan,et al.  RNA secondary structure prediction based on free energy and phylogenetic analysis. , 1999, Journal of molecular biology.

[4]  Rolf Backofen,et al.  Time and Space Efficient RNA-RNA Interaction Prediction via Sparse Folding , 2010, RECOMB.

[5]  Ryan Williams,et al.  Faster all-pairs shortest paths via circuit complexity , 2013, STOC.

[6]  D. Crothers,et al.  Improved estimation of secondary structure in ribonucleic acids. , 1973, Nature: New biology.

[7]  Dan Gusfield,et al.  A Simple, Practical and Complete O(\fracn3 logn)O(\frac{n^3}{ \log n})-Time Algorithm for RNA Folding Using the Four-Russians Speedup , 2009, WABI.

[8]  Tatsuya Akutsu Approximation and Exact Algorithms for RNA Secondary Structure Prediction and Recognition of Stochastic Context-free Languages , 1999, J. Comb. Optim..

[9]  David H. Mathews,et al.  RNAstructure: software for RNA secondary structure prediction and analysis , 2010, BMC Bioinformatics.

[10]  Michael Zuker,et al.  Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information , 1981, Nucleic Acids Res..

[11]  Ryan Williams,et al.  Matrix-vector multiplication in sub-quadratic time: (some preprocessing required) , 2007, SODA '07.

[12]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[13]  Dan Gusfield,et al.  A simple, practical and complete O-time Algorithm for RNA folding using the Four-Russians Speedup , 2010, Algorithms for Molecular Biology.

[14]  David Sankoff,et al.  RNA secondary structures and their prediction , 1984 .

[15]  Ron Shamir,et al.  A Faster Algorithm for RNA Co-folding , 2008, WABI.

[16]  Eric Westhof,et al.  Nucleic Acids and Molecular Biology , 1994, Nucleic Acids and Molecular Biology.

[17]  R. Nussinov,et al.  Fast algorithm for predicting the secondary structure of single-stranded RNA. , 1980, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Dan Gusfield,et al.  A simple, practical and complete O(n³/log n)-time algorithm for RNA folding using the four-Russians speedup , 2009, WABI 2009.

[19]  Timothy M. Chan Speeding up the Four Russians Algorithm by About One More Logarithmic Factor , 2015, SODA.

[20]  Rolf Backofen,et al.  Sparse RNA Folding: Time and Space Efficient Algorithms , 2009, CPM.

[21]  D. Turner,et al.  Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[22]  Sean R. Eddy,et al.  Evaluation of several lightweight stochastic context-free grammars for RNA secondary structure prediction , 2004, BMC Bioinformatics.

[23]  M. Waterman,et al.  RNA secondary structure: a complete mathematical analysis , 1978 .

[24]  Kevin P. Murphy,et al.  Efficient parameter estimation for RNA secondary structure prediction , 2007, ISMB/ECCB.

[25]  Dan Gusfield,et al.  Speedup of RNA Pseudoknotted Secondary Structure Recurrence Computation with the Four-Russians Method , 2012, COCOA.

[26]  Shane S. Sturrock,et al.  Time Warps, String Edits, and Macromolecules – The Theory and Practice of Sequence Comparison . David Sankoff and Joseph Kruskal. ISBN 1-57586-217-4. Price £13.95 (US$22·95). , 2000 .

[27]  J. McCaskill The equilibrium partition function and base pair binding probabilities for RNA secondary structure , 1990, Biopolymers.

[28]  Jerrold R. Griggs,et al.  Algorithms for Loop Matchings , 1978 .

[29]  Michal Ziv-Ukelson,et al.  Efficient edit distance with duplications and contractions , 2012, Algorithms for Molecular Biology.

[30]  Serafim Batzoglou,et al.  CONTRAfold: RNA secondary structure prediction without physics-based models , 2006, ISMB.

[31]  David Sankoff,et al.  Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison , 1983 .

[32]  D. Turner,et al.  Thermodynamic parameters for an expanded nearest-neighbor model for formation of RNA duplexes with Watson-Crick base pairs. , 1998, Biochemistry.

[33]  J. Sabina,et al.  Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. , 1999, Journal of molecular biology.

[34]  Rolf Backofen,et al.  Sparsification of RNA structure prediction including pseudoknots , 2010, Algorithms for Molecular Biology.

[35]  Michal Ziv-Ukelson,et al.  A Study of Accessible Motifs and RNA Folding Complexity , 2006, RECOMB.

[36]  Michal Ziv-Ukelson,et al.  Reducing the worst case running times of a family of RNA and CFG problems, using Valiant's approach , 2010, Algorithms for Molecular Biology.

[37]  Michael Zuker,et al.  An Updated Recursive Algorithm for RNA Secondary Structure Prediction with Improved Thermodynamic Parameters , 1997 .