PPalign: optimal alignment of Potts models representing proteins with direct coupling information

Background To assign structural and functional annotations to the ever increasing amount of sequenced proteins, the main approach relies on sequence-based homology search methods, e.g. BLAST or the current state-of-the-art methods based on profile Hidden Markov Models, which rely on significant alignments of query sequences to annotated proteins or protein families. While powerful, these approaches do not take coevolution between residues into account. Taking advantage of recent advances in the field of contact prediction, we propose here to represent proteins by Potts models, which model direct couplings between positions in addition to positional composition, and to compare proteins by aligning these models. Due to non-local dependencies, the problem of aligning Potts models is hard and remains the main computational bottleneck for their use. Methods We introduce here an Integer Linear Programming formulation of the problem and PPalign, a program based on this formulation, to compute the optimal pairwise alignment of Potts models representing proteins in tractable time. The approach is assessed with respect to a non-redundant set of reference pairwise sequence alignments from SISYPHUS benchmark which have lowest sequence identity (between $$3\%$$ 3 % and $$20\%$$ 20 % ) and enable to build reliable Potts models for each sequence to be aligned. This experimentation confirms that Potts models can be aligned in reasonable time ( $$1'37''$$ 1 ′ 37 ′ ′ in average on these alignments). The contribution of couplings is evaluated in comparison with HHalign and independent-site PPalign. Although Potts models were not fully optimized for alignment purposes and simple gap scores were used, PPalign yields a better mean $$F_1$$ F 1 score and finds significantly better alignments than HHalign and PPalign without couplings in some cases. Conclusions These results show that pairwise couplings from protein Potts models can be used to improve the alignment of remotely related protein sequences in tractable time. Our experimentation suggests yet that new research on the inference of Potts models is now needed to make them more comparable and suitable for homology search. We think that PPalign’s guaranteed optimality will be a powerful asset to perform unbiased investigations in this direction.

[1]  Andreas Prlic,et al.  SISYPHUS—structural alignments for proteins with non-trivial relationships , 2006, Nucleic Acids Res..

[2]  Roland L Dunbrack,et al.  Scoring profile‐to‐profile sequence alignments , 2004, Protein science : a publication of the Protein Society.

[3]  T. Hwa,et al.  Identification of direct residue contacts in protein–protein interaction by message passing , 2009, Proceedings of the National Academy of Sciences.

[4]  Serge Massar,et al.  Optimality of the genetic code with respect to protein stability and amino-acid frequencies , 2001, Genome Biology.

[5]  Markus Gruber,et al.  CCMpred—fast and precise prediction of protein residue–residue contacts from correlated mutations , 2014, Bioinform..

[6]  Lenore Cowen,et al.  SMURFLite: combining simplified Markov random fields with simulated evolution improves remote homology detection for beta-structural proteins into the twilight zone , 2012, Bioinform..

[7]  Sean R Eddy,et al.  Remote homology search with hidden Potts models , 2020, PLoS computational biology.

[8]  W. P. Russ,et al.  Evolutionary information for specifying a protein fold , 2005, Nature.

[9]  Sean R. Eddy,et al.  Remote homology search with hidden Potts models , 2020, bioRxiv.

[10]  R. Lathrop The protein threading problem with sequence amino acid interaction preferences is NP-complete. , 1994, Protein engineering.

[11]  Andrea Pagnani,et al.  Aligning biological sequences by exploiting residue conservation and coevolution , 2020, bioRxiv.

[12]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[13]  A. Tramontano,et al.  New encouraging developments in contact prediction: Assessment of the CASP11 results , 2016, Proteins.

[14]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[15]  Rumen Andonov,et al.  DALIX: Optimal DALI Protein Structure Alignment , 2013, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[16]  Rumen Andonov,et al.  Algorithm engineering for optimal alignment of protein structure distance matrices , 2011, Optim. Lett..

[17]  Zhiyong Wang,et al.  MRFalign: Protein Homology Detection through Alignment of Markov Random Fields , 2014, PLoS Comput. Biol..

[18]  Simona Cocco,et al.  ACE: adaptive cluster expansion for maximum entropy graphical model inference , 2016, bioRxiv.

[19]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[20]  Milot Mirdita,et al.  HH-suite3 for fast remote homology detection and deep protein annotation , 2019, BMC Bioinformatics.

[21]  Piero Fariselli,et al.  Fast overlapping of protein contact maps by alignment of eigenvectors , 2010, Bioinform..

[22]  Hugo Talibart,et al.  Using residues coevolution to search for protein homologs through alignment of Potts models , 2019 .

[23]  Lenore Cowen,et al.  MRFy: Remote Homology Detection for Beta-Structural Proteins Using Markov Random Fields and Stochastic Search , 2015, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[24]  Rumen Andonov,et al.  Maximum Contact Map Overlap Revisited , 2011, J. Comput. Biol..

[25]  Toni Gabaldón,et al.  trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses , 2009, Bioinform..

[26]  Carlo Baldassi,et al.  Fast and Accurate Multivariate Gaussian Modeling of Protein Families: Predicting Residue Contacts and Protein-Interaction Partners , 2014, PloS one.

[27]  J. Besag Statistical Analysis of Non-Lattice Data , 1975 .

[28]  Lenore Cowen,et al.  Markov random fields reveal an N-terminal double beta-propeller motif as part of a bacterial hybrid two-component sensor system , 2010, Proceedings of the National Academy of Sciences.

[29]  C. Sander,et al.  Protein structure comparison by alignment of distance matrices. , 1993, Journal of molecular biology.

[30]  Martin Weigt,et al.  How Pairwise Coevolutionary Models Capture the Collective Residue Variability in Proteins? , 2018, Molecular biology and evolution.

[31]  David D. Cox,et al.  Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures , 2013, ICML.

[32]  Bonnie Berger,et al.  Optimal contact map alignment of protein–protein interfaces , 2008, Bioinform..

[33]  Susann Vorberg,et al.  Bayesian statistical approach for protein residue-residue contact prediction , 2017 .

[34]  E. Aurell,et al.  Improved contact prediction in proteins: using pseudolikelihoods to infer Potts models. , 2012, Physical review. E, Statistical, nonlinear, and soft matter physics.

[36]  C. Sander,et al.  Direct-coupling analysis of residue coevolution captures native contacts across many protein families , 2011, Proceedings of the National Academy of Sciences.

[37]  Maria Jesus Martin,et al.  Uniclust databases of clustered and deeply annotated protein sequences and alignments , 2016, Nucleic Acids Res..