LinearPartition: linear-time approximation of RNA folding partition function and base-pairing probabilities

Abstract Motivation RNA secondary structure prediction is widely used to understand RNA function. Recently, there has been a shift away from the classical minimum free energy methods to partition function-based methods that account for folding ensembles and can therefore estimate structure and base pair probabilities. However, the classical partition function algorithm scales cubically with sequence length, and is therefore prohibitively slow for long sequences. This slowness is even more severe than cubic-time free energy minimization due to a substantially larger constant factor in runtime. Results Inspired by the success of our recent LinearFold algorithm that predicts the approximate minimum free energy structure in linear time, we design a similar linear-time heuristic algorithm, LinearPartition, to approximate the partition function and base-pairing probabilities, which is shown to be orders of magnitude faster than Vienna RNAfold and CONTRAfold (e.g. 2.5 days versus 1.3 min on a sequence with length 32 753 nt). More interestingly, the resulting base-pairing probabilities are even better correlated with the ground-truth structures. LinearPartition also leads to a small accuracy improvement when used for downstream structure prediction on families with the longest length sequences (16S and 23S rRNAs), as well as a substantial improvement on long-distance base pairs (500+ nt apart). Availability and implementation Code: http://github.com/LinearFold/LinearPartition; Server: http://linearfold.org/partition. Supplementary information Supplementary data are available at Bioinformatics online.

[1]  C. Lawrence,et al.  RNA secondary structure prediction by centroids in a Boltzmann weighted ensemble. , 2005, RNA.

[2]  Jinwei Zhang,et al.  New molecular engineering approaches for crystallographic studies of large RNAs. , 2014, Current opinion in structural biology.

[3]  David H. Mathews,et al.  Automated RNA tertiary structure prediction from secondary structure and low‐resolution restraints , 2011, J. Comput. Chem..

[4]  Katarzyna J Purzycka,et al.  RNA-Puzzles Round III: 3D RNA structure prediction of five riboswitches and one ribozyme. , 2017, RNA.

[5]  Bjarne Knudsen,et al.  Pfold: RNA Secondary Structure Prediction Using Stochastic Context-Free Grammars , 2003 .

[6]  Dmitry Lyumkis,et al.  Challenges and opportunities in cryo-EM single-particle analysis , 2019, The Journal of Biological Chemistry.

[7]  Niles A. Pierce,et al.  Nucleic acid sequence design via efficient ensemble defect optimization , 2011, J. Comput. Chem..

[8]  I. Tinoco,et al.  How RNA folds. , 1999, Journal of molecular biology.

[9]  Pablo Cordero,et al.  Rich RNA Structure Landscapes Revealed by Mutate-and-Map Analysis , 2015, PLoS Comput. Biol..

[10]  Michael F. Sloma,et al.  AccessFold: predicting RNA-RNA interactions with consideration for competing self-structure , 2016, Bioinform..

[11]  Michael Zuker,et al.  Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information , 1981, Nucleic Acids Res..

[12]  J. Baker Trainable grammars for speech recognition , 1979 .

[13]  David H. Mathews,et al.  mRNAs and lncRNAs intrinsically form secondary structures with short end-to-end distances , 2018, Nature Communications.

[14]  Kevin P. Murphy,et al.  Efficient parameter estimation for RNA secondary structure prediction , 2007, ISMB/ECCB.

[15]  D. Mathews Using an RNA secondary structure partition function to determine confidence in base pairs predicted by free energy minimization. , 2004, RNA.

[16]  David H. Mathews,et al.  A sequence similar to tRNA3Lys gene is embedded in HIV-1 U3/R and promotes minus strand transfer , 2009, Nature Structural &Molecular Biology.

[17]  David H Mathews,et al.  Prediction of RNA secondary structure by free energy minimization. , 2006, Current opinion in structural biology.

[18]  Kiyoshi Asai,et al.  Rfold: an exact algorithm for computing local base pairing probabilities , 2008, Bioinform..

[19]  Kai Zhao,et al.  LinearFold: linear-time approximate RNA folding by 5'-to-3' dynamic programming and beam search , 2019, Bioinform..

[20]  D. Turner,et al.  Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[21]  D. Turner,et al.  Thermodynamic parameters for an expanded nearest-neighbor model for formation of RNA duplexes with Watson-Crick base pairs. , 1998, Biochemistry.

[22]  Gholamreza Haffari,et al.  An Efficient Algorithm for Upper Bound on the Partition Function of Nucleic Acids , 2013, J. Comput. Biol..

[23]  M. Huynen,et al.  Assessing the reliability of RNA folding using statistical mechanics. , 1997, Journal of molecular biology.

[24]  Tatsuya Akutsu,et al.  IPknot: fast and accurate prediction of RNA secondary structures with pseudoknots using integer programming , 2011, Bioinform..

[25]  S. Eddy Non–coding RNA genes and the modern RNA world , 2001, Nature Reviews Genetics.

[26]  J. Sabina,et al.  Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. , 1999, Journal of molecular biology.

[27]  D. Mathews,et al.  ProbKnot: fast prediction of RNA secondary structure including pseudoknots. , 2010, RNA.

[28]  David H. Mathews,et al.  Efficient siRNA selection using hybridization thermodynamics , 2007, Nucleic acids research.

[29]  Kenji Sagae,et al.  Dynamic Programming for Linear-Time Incremental Parsing , 2010, ACL.

[30]  Nan Yu,et al.  The Comparative RNA Web (CRW) Site: an online database of comparative sequence and structure information for ribosomal, intron, and other RNAs , 2002, BMC Bioinformatics.

[31]  Erik Winfree,et al.  Thermodynamic Analysis of Interacting Nucleic Acid Strands , 2007, SIAM Rev..

[32]  Raffaele Giancarlo,et al.  Speeding up the Consensus Clustering methodology for microarray data analysis , 2011, Algorithms for Molecular Biology.

[33]  Rolf Backofen,et al.  Sparse RNA folding: Time and space efficient algorithms , 2009, J. Discrete Algorithms.

[34]  Peter F. Stadler,et al.  Local RNA base pairing probabilities in large sequences , 2006, Bioinform..

[35]  Jana Sperschneider,et al.  DotKnot: pseudoknot prediction using the probability dot plot under a refined energy model , 2010, Nucleic acids research.

[36]  Alex Bateman,et al.  RNAcentral: a comprehensive database of non-coding RNA sequences , 2016, Nucleic acids research.

[37]  C. Lawrence,et al.  A statistical sampling algorithm for RNA secondary structure prediction. , 2003, Nucleic acids research.

[38]  Serafim Batzoglou,et al.  CONTRAfold: RNA secondary structure prediction without physics-based models , 2006, ISMB.

[39]  Wei Wu,et al.  NONCODE 2016: an informative and valuable data source of long non-coding RNAs , 2015, Nucleic Acids Res..

[40]  Peter F. Stadler,et al.  ViennaRNA Package 2.0 , 2011, Algorithms for Molecular Biology.

[41]  Dang D. Long,et al.  Potent effect of target structure on microRNA function , 2007, Nature Structural &Molecular Biology.

[42]  Sarah C. Keane,et al.  Advances that facilitate the study of large RNA structure and dynamics by nuclear magnetic resonance spectroscopy , 2019, Wiley interdisciplinary reviews. RNA.

[43]  Russ B Altman,et al.  Turning limited experimental information into 3D models of RNA. , 2010, RNA.

[44]  David H Mathews,et al.  Revolutions in RNA secondary structure prediction. , 2006, Journal of molecular biology.

[45]  Peter F. Stadler,et al.  Partition function and base pairing probabilities of RNA heterodimers , 2006, Algorithms for Molecular Biology.

[46]  R. Nussinov,et al.  Fast algorithm for predicting the secondary structure of single-stranded RNA. , 1980, Proceedings of the National Academy of Sciences of the United States of America.

[47]  D. Mathews,et al.  A sensitivity analysis of RNA folding nearest neighbor parameters identifies a subset of free energy parameters with the greatest impact on RNA secondary structure prediction , 2017, Nucleic acids research.

[48]  Michael F. Sloma,et al.  Exact calculation of loop formation probability identifies folding motifs in RNA secondary structures , 2016, RNA.

[49]  Holger H. Hoos,et al.  Ensemble-based prediction of RNA secondary structures , 2013, BMC Bioinformatics.

[50]  Hamidreza Chitsaz,et al.  A partition function algorithm for interacting nucleic acid strands , 2009, Bioinform..

[51]  Jennifer A. Doudna,et al.  The chemical repertoire of natural ribozymes , 2002, Nature.

[52]  J. McCaskill The equilibrium partition function and base pair binding probabilities for RNA secondary structure , 1990, Biopolymers.

[53]  Rolf Backofen,et al.  Global or local? Predicting secondary structure and accessibility in mRNAs , 2012, Nucleic acids research.

[54]  Peter Clote,et al.  RNA Thermodynamic Structural Entropy , 2015, PloS one.

[55]  D. Mathews,et al.  Improved RNA secondary structure prediction by maximizing expected pair accuracy. , 2009, RNA.

[56]  Christian N. S. Pedersen,et al.  RNA Pseudoknot Prediction in Energy-Based Models , 2000, J. Comput. Biol..

[57]  D. Mathews,et al.  Discovery of Novel ncRNA Sequences in Multiple Genome Alignments on the Basis of Conserved and Stable Secondary Structures , 2015, PloS one.

[58]  David Chiang,et al.  Better k-best Parsing , 2005, IWPT.

[59]  Stefan L Ameres,et al.  The impact of target site accessibility on the design of effective siRNAs , 2008, Nature Biotechnology.

[60]  A. Hüttenhofer,et al.  The expanding snoRNA world. , 2002, Biochimie.

[61]  Liang Huang,et al.  ThreshKnot: Thresholded ProbKnot for Improved RNA Secondary Structure Prediction , 2019, 1912.12796.