ComPotts: Optimal alignment of coevolutionary models for protein sequences

To assign structural and functional annotations to the ever increasing amount of sequenced proteins, the main approach relies on sequence-based homology search methods, e.g. BLAST or the current state-of-the-art methods based on profile Hidden Markov Models (pHMMs), which rely on significant alignments of query sequences to annotated proteins or protein families. While powerful, these approaches do not take coevolution between residues into account. Taking advantage of recent advances in the field of contact prediction, we propose here to represent proteins by Potts models, which model direct couplings between positions in addition to positional composition. Due to the presence of non-local dependencies, aligning two Potts models is computationally hard. To tackle this task, we introduce an Integer Linear Programming formulation of the problem and present ComPotts, an implementation able to compute the optimal alignment of two Potts models representing proteins in tractable time. A first experimentation on 59 low sequence identity pairwise alignments, extracted from 3 reference alignments from sisyphus and BaliBase3 databases, shows that ComPotts finds better alignments than the other tested methods in the majority of these cases.

[1]  Lenore Cowen,et al.  SMURFLite: combining simplified Markov random fields with simulated evolution improves remote homology detection for beta-structural proteins into the twilight zone , 2012, Bioinform..

[2]  Rumen Andonov,et al.  DALIX: Optimal DALI Protein Structure Alignment , 2013, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[3]  Markus Gruber,et al.  CCMpred—fast and precise prediction of protein residue–residue contacts from correlated mutations , 2014, Bioinform..

[4]  Rumen Andonov,et al.  Algorithm engineering for optimal alignment of protein structure distance matrices , 2011, Optim. Lett..

[5]  Zhiyong Wang,et al.  MRFalign: Protein Homology Detection through Alignment of Markov Random Fields , 2014, PLoS Comput. Biol..

[6]  Martin Weigt,et al.  How Pairwise Coevolutionary Models Capture the Collective Residue Variability in Proteins? , 2018, Molecular biology and evolution.

[7]  Lenore Cowen,et al.  MRFy: Remote Homology Detection for Beta-Structural Proteins Using Markov Random Fields and Stochastic Search , 2015, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[8]  Carlo Baldassi,et al.  Fast and Accurate Multivariate Gaussian Modeling of Protein Families: Predicting Residue Contacts and Protein-Interaction Partners , 2014, PloS one.

[9]  J. Besag Statistical Analysis of Non-Lattice Data , 1975 .

[10]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[11]  T. Hwa,et al.  Identification of direct residue contacts in protein–protein interaction by message passing , 2009, Proceedings of the National Academy of Sciences.

[12]  Inken Wohlers,et al.  Exact algorithms for pairwise protein structure alignment , 2012 .

[13]  C. Sander,et al.  Direct-coupling analysis of residue coevolution captures native contacts across many protein families , 2011, Proceedings of the National Academy of Sciences.

[14]  Jianzhu Ma,et al.  Protein structure alignment beyond spatial proximity , 2013, Scientific Reports.

[15]  Olivier Poch,et al.  BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs , 1999, Bioinform..

[16]  J. M. Sauder,et al.  Large‐scale comparison of protein sequence alignment algorithms with structure alignments , 2000, Proteins.

[17]  Maria Jesus Martin,et al.  Uniclust databases of clustered and deeply annotated protein sequences and alignments , 2016, Nucleic Acids Res..

[18]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[19]  Olivier Poch,et al.  BAliBASE 3.0: Latest developments of the multiple sequence alignment benchmark , 2005, Proteins.

[20]  A. Tramontano,et al.  New encouraging developments in contact prediction: Assessment of the CASP11 results , 2016, Proteins.

[21]  Andreas Prlic,et al.  SISYPHUS—structural alignments for proteins with non-trivial relationships , 2006, Nucleic Acids Res..

[22]  Hugo Talibart,et al.  Using residues coevolution to search for protein homologs through alignment of Potts models , 2019 .

[23]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[24]  W. P. Russ,et al.  Evolutionary information for specifying a protein fold , 2005, Nature.

[25]  Rumen Andonov,et al.  Maximum Contact Map Overlap Revisited , 2011, J. Comput. Biol..

[26]  E. Aurell,et al.  Improved contact prediction in proteins: using pseudolikelihoods to infer Potts models. , 2012, Physical review. E, Statistical, nonlinear, and soft matter physics.

[27]  R. Lathrop The protein threading problem with sequence amino acid interaction preferences is NP-complete. , 1994, Protein engineering.

[28]  Simona Cocco,et al.  ACE: adaptive cluster expansion for maximum entropy graphical model inference , 2016, bioRxiv.

[29]  Susann Vorberg,et al.  Bayesian statistical approach for protein residue-residue contact prediction , 2017 .

[30]  Milot Mirdita,et al.  HH-suite3 for fast remote homology detection and deep protein annotation , 2019, BMC Bioinformatics.

[31]  Lenore Cowen,et al.  Markov random fields reveal an N-terminal double beta-propeller motif as part of a bacterial hybrid two-component sensor system , 2010, Proceedings of the National Academy of Sciences.