Remote homology search with hidden Potts models

Most methods for biological sequence homology search and alignment work with primary sequence alone, neglecting higher-order correlations. Recently, statistical physics models called Potts models have been used to infer all-by-all pairwise correlations between sites in deep multiple sequence alignments, and these pairwise couplings have improved 3D structure predictions. Here we extend the use of Potts models from structure prediction to sequence alignment and homology search by developing what we call a hidden Potts model (HPM) that merges a Potts emission process to a generative probability model of insertion and deletion. Because an HPM is incompatible with efficient dynamic programming alignment algorithms, we develop an approximate algorithm based on importance sampling, using simpler probabilistic models as proposal distributions. We test an HPM implementation on RNA structure homology search benchmarks, where we can compare directly to exact alignment methods that capture nested RNA base-pairing correlations (stochastic context-free grammars). HPMs perform promisingly in these proof of principle experiments. Author summary Computational homology search and alignment tools are used to infer the functions and evolutionary histories of biological sequences. Most widely used tools for sequence homology searches, such as BLAST and HMMER, rely on primary sequence conservation alone. It should be possible to make more powerful search tools by also considering higher-order covariation patterns induced by 3D structure conservation. Recent advances in 3D protein structure prediction have used a class of statistical physics models called Potts models to infer pairwise correlation structure in multiple sequence alignments. However, Potts models assume alignments are given and cannot build new alignments, limiting their use in homology search. We have extended Potts models to include a probability model of insertion and deletion so they can be applied to sequence alignment and remote homology search using a new model we call a hidden Potts model (HPM). Tests of our prototype HPM software show promising results in initial benchmarking experiments, though more work will be needed to use HPMs in practical tools.

[1]  Glazier,et al.  Simulation of biological cell sorting using a two-dimensional extended Potts model. , 1992, Physical review letters.

[2]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[3]  Sean R. Eddy,et al.  A Probabilistic Model of Local Sequence Alignment That Simplifies Statistical Significance Estimation , 2008, PLoS Comput. Biol..

[4]  R. Montange,et al.  Structure of the S-adenosylmethionine riboswitch regulatory mRNA element , 2006, Nature.

[5]  S. Henikoff,et al.  Protein family classification based on searching a database of blocks. , 1994, Genomics.

[6]  E. Rivas,et al.  RNA structure prediction using positive and negative evolutionary information , 2020, bioRxiv.

[7]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[8]  Adam J. Riesselman,et al.  3D RNA and Functional Interactions from Evolutionary Couplings , 2015, Cell.

[9]  Simona Cocco,et al.  ACE: adaptive cluster expansion for maximum entropy graphical model inference , 2016, bioRxiv.

[10]  C. Sander,et al.  Direct-coupling analysis of residue coevolution captures native contacts across many protein families , 2011, Proceedings of the National Academy of Sciences.

[11]  E. Aurell,et al.  Improved contact prediction in proteins: using pseudolikelihoods to infer Potts models. , 2012, Physical review. E, Statistical, nonlinear, and soft matter physics.

[12]  A. Murray,et al.  Many, but not all, lineage-specific genes can be explained by homology detection failure , 2020, bioRxiv.

[13]  Rama Ranganathan,et al.  Coevolution-based inference of amino acid interactions underlying protein function , 2018, eLife.

[14]  T. Smith,et al.  Modeling protein cores with Markov random fields. , 1994, Mathematical biosciences.

[15]  R. Breaker,et al.  A widespread self-cleaving ribozyme class is revealed by bioinformatics , 2013, Nature chemical biology.

[16]  Lenore Cowen,et al.  Markov random fields reveal an N-terminal double beta-propeller motif as part of a bacterial hybrid two-component sensor system , 2010, Proceedings of the National Academy of Sciences.

[17]  Jinbo Xu,et al.  A multiple‐template approach to protein threading , 2011, Proteins.

[18]  Sean R. Eddy,et al.  Multiple Alignment Using Hidden Markov Models , 1995, ISMB.

[19]  T. D. Schneider,et al.  Information content of binding sites on nucleotide sequences. , 1986, Journal of molecular biology.

[20]  Sam Griffiths-Jones,et al.  RALEE--RNA ALignment Editor in Emacs , 2005, Bioinform..

[21]  D. Haussler,et al.  Protein modeling using hidden Markov models: analysis of globins , 1993, [1993] Proceedings of the Twenty-sixth Hawaii International Conference on System Sciences.

[22]  David K. Y. Chiu,et al.  Inferring consensus structure from nucleic acid sequences , 1991, Comput. Appl. Biosci..

[23]  D. Crothers,et al.  Is there a discriminator site in transfer RNA? , 1972, Proceedings of the National Academy of Sciences of the United States of America.

[24]  Lucy J. Colwell,et al.  Inferring interaction partners from protein sequences , 2016, Proceedings of the National Academy of Sciences.

[25]  A. Kinjo A unified statistical model of protein multiple sequence alignment integrating direct coupling and insertions , 2015, Biophysics and physicobiology.

[26]  D. Baker,et al.  Robust and accurate prediction of residue–residue interactions across protein interfaces using evolutionary information , 2014, eLife.

[27]  Sean R. Eddy,et al.  Infernal 1.1: 100-fold faster RNA homology searches , 2013, Bioinform..

[28]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[29]  S. Eddy,et al.  A statistical test for conserved RNA structure shows lack of evidence for structure in lncRNAs , 2016, Nature Methods.

[30]  M. Weigt,et al.  Coevolutionary Landscape Inference and the Context-Dependence of Mutations in Beta-Lactamase TEM-1 , 2015, bioRxiv.

[31]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[32]  Sergey Steinberg,et al.  Compilation of tRNA sequences and sequences of tRNA genes , 2004, Nucleic Acids Res..

[33]  Simona Cocco,et al.  Inverse statistical physics of protein sequences: a key issues review , 2017, Reports on progress in physics. Physical Society.

[34]  Carlo Baldassi,et al.  Simultaneous identification of specifically interacting paralogs and interprotein contacts by direct coupling analysis , 2016, Proceedings of the National Academy of Sciences.

[35]  Chris Bailey-Kellogg,et al.  Graphical Models of Residue Coupling in Protein Families , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[36]  Jaime G. Carbonell,et al.  Conditional Graphical Models for Protein Structural Motif Recognition , 2009, J. Comput. Biol..

[37]  Thomas A. Hopf,et al.  Mutation effects predicted from sequence co-variation , 2017, Nature Biotechnology.

[38]  Gary D. Stormo,et al.  A Maximum Entropy Formalism for Disentangling Chains of Correlated Sequence Positions , 1998, ISMB 1998.

[39]  R. Durbin,et al.  RNA sequence analysis using covariance models. , 1994, Nucleic acids research.

[40]  T. Hwa,et al.  Identification of direct residue contacts in protein–protein interaction by message passing , 2009, Proceedings of the National Academy of Sciences.

[41]  Sean R. Eddy,et al.  Query-Dependent Banding (QDB) for Faster RNA Similarity Searches , 2007, PLoS Comput. Biol..

[42]  G. Stormo,et al.  Identifying constraints on the higher-order structure of RNA: continued development and application of comparative sequence analysis methods. , 1992, Nucleic acids research.

[43]  D. Baker,et al.  Assessing the utility of coevolution-based residue–residue contact predictions in a sequence- and structure-rich era , 2013, Proceedings of the National Academy of Sciences.

[44]  M. Sundaralingam,et al.  Restrained refinement of the monoclinic form of yeast phenylalanine transfer RNA. Temperature factors and dynamics, coordinated waters, and base-pair propeller twist angles. , 1986, Biochemistry.

[45]  Sean R. Eddy,et al.  Accelerated Profile HMM Searches , 2011, PLoS Comput. Biol..

[46]  J. Besag Efficiency of pseudolikelihood estimation for simple Gaussian fields , 1977 .

[47]  Andrea Pagnani,et al.  Aligning biological sequences by exploiting residue conservation and coevolution , 2020, bioRxiv.

[48]  Lenore Cowen,et al.  SMURFLite: combining simplified Markov random fields with simulated evolution improves remote homology detection for beta-structural proteins into the twilight zone , 2012, Bioinform..

[49]  D. Baker,et al.  Protein interaction networks revealed by proteome coevolution , 2019, Science.

[50]  Temple F. Smith,et al.  Global optimum protein threading with gapped alignment and empirical pair score functions. , 1996, Journal of molecular biology.

[51]  Michael J. Berry,et al.  Weak pairwise correlations imply strongly correlated network states in a neural population , 2005, Nature.

[52]  R. Levy,et al.  Influence of multiple-sequence-alignment depth on Potts statistical models of protein covariation. , 2018, Physical review. E.

[53]  Robert D. Finn,et al.  Rfam 13.0: shifting to a genome-centric resource for non-coding RNA families , 2017, Nucleic Acids Res..

[54]  R. Levy,et al.  Potts Hamiltonian models of protein co-variation, free energy landscapes, and evolutionary fitness. , 2017, Current opinion in structural biology.

[55]  Simona Cocco,et al.  Direct-Coupling Analysis of nucleotide coevolution facilitates RNA secondary and tertiary structure prediction , 2015, Nucleic acids research.