Inferring interaction partners from protein sequences using mutual information

Specific protein-protein interactions are crucial in most cellular processes. They enable multiprotein complexes to assemble and to remain stable, and they allow signal transduction in various pathways. Functional interactions between proteins result in coevolution between the interacting partners, and thus in correlations between their sequences. Pairwise maximum-entropy based models have enabled successful inference of pairs of amino-acid residues that are in contact in the three-dimensional structure of multi-protein complexes, starting from the correlations in the sequence data of known interaction partners. Recently, algorithms inspired by these methods have been developed to identify which proteins are specific interaction partners among the paralogous proteins of two families, starting from sequence data alone. Here, we demonstrate that a slightly higher performance for partner identification can be reached by an approximate maximization of the mutual information between the sequence alignments of the two protein families. This stands in contrast with structure prediction of proteins and of multiprotein complexes from sequence data, where pairwise maximum-entropy based global statistical models substantially improve performance compared to mutual information. Our findings entail that the statistical dependences allowing interaction partner prediction from sequence data are not restricted to the residue pairs that are in direct contact at the interface between the partner proteins. Author summary Specific protein-protein interactions are at the heart of most intra-cellular processes. Mapping these interactions is thus crucial to a systems-level understanding of cells, and has broad applications to areas such as drug targeting. Systematic experimental identification of protein interaction partners is still challenging. However, a large and rapidly growing amount of sequence data is now available. Recently, algorithms have been proposed to identify which proteins interact from their sequences alone, thanks to the co-variation of the sequences of interacting proteins. These algorithms build upon inference methods that have been used with success to predict the three-dimensional structures of proteins and multi-protein complexes, and their focus is on the amino-acid residues that are in direct contact. Here, we propose a simpler method to identify which proteins interact among the paralogous proteins of two families, starting from their sequences alone. Our method relies on an approximate maximization of mutual information between the sequences of the two families, without specifically emphasizing the contacting residue pairs. We demonstrate that this method slightly outperforms the earlier one. This result highlights that partner prediction does not only rely on the identities and interactions of directly contacting amino-acids.

[1]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[2]  William Bialek,et al.  Entropy and Inference, Revisited , 2001, NIPS.

[3]  Daisuke Tsuchiya,et al.  Structural basis for channelling mechanism of a fatty acid β‐oxidation multienzyme complex , 2004, The EMBO journal.

[4]  Najeeb M. Halabi,et al.  Protein Sectors: Evolutionary Units of Three-Dimensional Structure , 2009, Cell.

[5]  Lucy J. Colwell,et al.  Power law tails in phylogenetic systems , 2018, Proceedings of the National Academy of Sciences.

[6]  Mohamed Nadif,et al.  Handling the Impact of Low Frequency Events on Co-occurrence based Measures of Word Similarity - A Case Study of Pointwise Mutual Information , 2011, KDIR.

[7]  F. Morcos,et al.  Genomics-aided structure prediction , 2012, Proceedings of the National Academy of Sciences.

[8]  Thomas A. Hopf,et al.  Protein 3D Structure Computed from Evolutionary Sequence Variation , 2011, PloS one.

[9]  Gregory B. Gloor,et al.  Mutual information without the influence of phylogeny or entropy dramatically improves residue contact prediction , 2008, Bioinform..

[10]  G. Stormo,et al.  Correlated mutations in models of protein sequences: phylogenetic and structural effects , 1999 .

[11]  Michael A. Rowlanda,et al.  Corrections , 2017, The Lancet Neurology.

[12]  M. Laub,et al.  Specificity in two-component signal transduction pathways. , 2007, Annual review of genetics.

[13]  C. Kisker,et al.  Mechanism of Substrate and Inhibitor Binding of Rhodobacter capsulatus Xanthine Dehydrogenase* , 2009, Journal of Biological Chemistry.

[14]  C. Sander,et al.  A method to predict functional residues in proteins , 1995, Nature Structural Biology.

[15]  Michael T. Laub,et al.  Pervasive degeneracy and epistasis in a protein-protein interface , 2015, Science.

[16]  Carlo Baldassi,et al.  Fast and Accurate Multivariate Gaussian Modeling of Protein Families: Predicting Residue Contacts and Protein-Interaction Partners , 2014, PloS one.

[17]  D. Rees,et al.  ABC transporters: the power to change , 2009, Nature Reviews Molecular Cell Biology.

[18]  Thomas A. Hopf,et al.  Sequence co-evolution gives 3D contacts and structures of protein complexes , 2014, eLife.

[19]  J. Munkres ALGORITHMS FOR THE ASSIGNMENT AND TRANSIORTATION tROBLEMS* , 1957 .

[20]  B. Lunt,et al.  Dissecting the Specificity of Protein-Protein Interaction in Bacterial Two-Component Signaling: Orphans and Crosstalks , 2011, PloS one.

[21]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[22]  Lucy J. Colwell,et al.  Predicting Functionally Informative Mutations in Escherichia coli BamA Using Evolutionary Covariance Analysis , 2013, Genetics.

[23]  Lucy J. Colwell,et al.  Inferring interaction partners from protein sequences , 2016, Proceedings of the National Academy of Sciences.

[24]  G. A. Barnard,et al.  Transmission of Information: A Statistical Theory of Communications. , 1961 .

[25]  Philippe Ortet,et al.  P2CS: updates of the prokaryotic two-component systems database , 2014, Nucleic Acids Res..

[26]  William Bialek,et al.  Entropy and information in neural spike trains: progress on the sampling problem. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[27]  Rama Ranganathan,et al.  Evolution-Based Functional Decomposition of Proteins , 2015, bioRxiv.

[28]  Massimiliano Pontil,et al.  PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments , 2012, Bioinform..

[29]  Ned S. Wingreen,et al.  Revealing evolutionary constraints on proteins through sequence analysis , 2019, PLoS Comput. Biol..

[30]  W. Bialek Biophysics: Searching for Principles , 2012 .

[31]  E. van Nimwegen,et al.  Accurate Prediction of Protein–protein Interactions from Sequence Alignments Using a Bayesian Method , 2022 .

[32]  Michael T Laub,et al.  Evolution of two-component signal transduction systems. , 2012, Annual review of microbiology.

[33]  Michael T. Laub,et al.  Determinants of specificity in two-component signal transduction. , 2013, Current opinion in microbiology.

[34]  K. C. Huang,et al.  Coupling between Protein Stability and Catalytic Activity Determines Pathogenicity of G6PD Variants. , 2017, Cell reports.

[35]  T. Hwa,et al.  Identification of direct residue contacts in protein–protein interaction by message passing , 2009, Proceedings of the National Academy of Sciences.

[36]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[37]  M. Weigt,et al.  Coevolutionary Landscape Inference and the Context-Dependence of Mutations in Beta-Lactamase TEM-1 , 2015, bioRxiv.

[38]  C. Sander,et al.  Direct-coupling analysis of residue coevolution captures native contacts across many protein families , 2011, Proceedings of the National Academy of Sciences.

[39]  Le Yan,et al.  Architecture and coevolution of allosteric materials , 2016, Proceedings of the National Academy of Sciences.

[40]  D. Baker,et al.  Robust and accurate prediction of residue–residue interactions across protein interfaces using evolutionary information , 2014, eLife.

[41]  P. Ortet,et al.  P2CS: a two-component system resource for prokaryotic signal transduction research , 2009, BMC Genomics.

[42]  David E. Kim,et al.  Large-scale determination of previously unsolved protein structures using evolutionary information , 2015, eLife.

[43]  Adam P. Arkin,et al.  The Evolution of Two-Component Systems in Bacteria Reveals Different Strategies for Niche Adaptation , 2006, PLoS Comput. Biol..

[44]  Carlo Baldassi,et al.  Simultaneous identification of specifically interacting paralogs and interprotein contacts by direct coupling analysis , 2016, Proceedings of the National Academy of Sciences.

[45]  P. Uetz,et al.  The binary protein-protein interaction landscape of Escherichia coli , 2014, Nature Biotechnology.

[46]  Andrea Pagnani,et al.  Inter-Protein Sequence Co-Evolution Predicts Known Physical Interactions in Bacterial Ribosomes and the Trp Operon , 2015, PloS one.

[47]  E. Jaynes Information Theory and Statistical Mechanics , 1957 .

[48]  José N. Onuchic,et al.  Toward rationally redesigning bacterial two-component signaling systems using coevolutionary information , 2014, Proceedings of the National Academy of Sciences.

[49]  Michael T. Laub,et al.  Rewiring the Specificity of Two-Component Signal Transduction Systems , 2008, Cell.