Fast and Accurate Multivariate Gaussian Modeling of Protein Families: Predicting Residue Contacts and Protein-Interaction Partners

In the course of evolution, proteins show a remarkable conservation of their three-dimensional structure and their biological function, leading to strong evolutionary constraints on the sequence variability between homologous proteins. Our method aims at extracting such constraints from rapidly accumulating sequence data, and thereby at inferring protein structure and function from sequence information alone. Recently, global statistical inference methods (e.g. direct-coupling analysis, sparse inverse covariance estimation) have achieved a breakthrough towards this aim, and their predictions have been successfully implemented into tertiary and quaternary protein structure prediction methods. However, due to the discrete nature of the underlying variable (amino-acids), exact inference requires exponential time in the protein length, and efficient approximations are needed for practical applicability. Here we propose a very efficient multivariate Gaussian modeling approach as a variant of direct-coupling analysis: the discrete amino-acid variables are replaced by continuous Gaussian random variables. The resulting statistical inference problem is efficiently and exactly solvable. We show that the quality of inference is comparable or superior to the one achieved by mean-field approximations to inference with discrete variables, as done by direct-coupling analysis. This is true for (i) the prediction of residue-residue contacts in proteins, and (ii) the identification of protein-protein interaction partner in bacterial signal transduction. An implementation of our multivariate Gaussian approach is available at the website http://areeweb.polito.it/ricerca/cmp/code.

[1]  William R. Taylor,et al.  Direct correlation analysis improves fold recognition , 2011, Comput. Biol. Chem..

[2]  E. Neher How frequent are correlated changes in families of protein sequences? , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[3]  W. Bialek,et al.  Maximum entropy models for antibody diversity , 2009, Proceedings of the National Academy of Sciences.

[4]  Massimiliano Pontil,et al.  PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments , 2012, Bioinform..

[5]  B. Lunt,et al.  Dissecting the Specificity of Protein-Protein Interaction in Bacterial Two-Component Signaling: Orphans and Crosstalks , 2011, PloS one.

[6]  Erik van Nimwegen,et al.  Disentangling Direct from Indirect Co-Evolution of Residues in Protein Alignments , 2010, PLoS Comput. Biol..

[7]  Thomas A. Hopf,et al.  Protein 3D Structure Computed from Evolutionary Sequence Variation , 2011, PloS one.

[8]  Jerry Nedelman,et al.  Book review: “Bayesian Data Analysis,” Second Edition by A. Gelman, J.B. Carlin, H.S. Stern, and D.B. Rubin Chapman & Hall/CRC, 2004 , 2005, Comput. Stat..

[9]  N. Stanietsky,et al.  The interaction of TIGIT with PVR and PVRL2 inhibits human NK cell cytotoxicity , 2009, Proceedings of the National Academy of Sciences.

[10]  David T. Jones,et al.  Protein topology from predicted residue contacts , 2012, Protein science : a publication of the Protein Society.

[11]  Timothy Nugent,et al.  Accurate de novo structure prediction of large transmembrane protein domains using fragment-assembly and correlated mutation analysis , 2012, Proceedings of the National Academy of Sciences.

[12]  Gregory B. Gloor,et al.  Mutual information without the influence of phylogeny or entropy dramatically improves residue contact prediction , 2008, Bioinform..

[13]  Martin Weigt,et al.  Structural basis of histidine kinase autophosphorylation deduced by integrating genomics, molecular dynamics, and mutagenesis , 2012, Proceedings of the National Academy of Sciences.

[14]  Christopher Jarzynski,et al.  Using Sequence Alignments to Predict Protein Structure and Stability With High Accuracy , 2012, 1207.2484.

[15]  E. Jaynes Information Theory and Statistical Mechanics , 1957 .

[16]  C. Sander,et al.  Correlated mutations and residue contacts in proteins , 1994, Proteins.

[17]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[18]  J. Hoch,et al.  Keeping Signals Straight in Phosphorelay Signal Transduction , 2001, Journal of bacteriology.

[19]  M. Laub,et al.  Specificity in two-component signal transduction pathways. , 2007, Annual review of genetics.

[20]  A. Lesk,et al.  Correlation of co-ordinated amino acid substitutions with function in viruses related to tobacco mosaic virus. , 1987, Journal of molecular biology.

[21]  D. Baker,et al.  Assessing the utility of coevolution-based residue–residue contact predictions in a sequence- and structure-rich era , 2013, Proceedings of the National Academy of Sciences.

[22]  M. Y. Lobanov,et al.  To be folded or to be unfolded? , 2004, Protein science : a publication of the Protein Society.

[23]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[24]  R. Aldrich,et al.  Influence of conservation on calculations of amino acid covariance in multiple sequence alignments , 2004, Proteins.

[25]  Cajo J. F. ter Braak,et al.  Correlated mutations via regularized multinomial regression , 2011, BMC Bioinformatics.

[26]  G. Stormo,et al.  Correlated mutations in models of protein sequences: phylogenetic and structural effects , 1999 .

[27]  R. Bourret,et al.  Two-component signal transduction. , 2010, Current opinion in microbiology.

[28]  E. Aurell,et al.  Improved contact prediction in proteins: using pseudolikelihoods to infer Potts models. , 2012, Physical review. E, Statistical, nonlinear, and soft matter physics.

[29]  C. Sander,et al.  Direct-coupling analysis of residue coevolution captures native contacts across many protein families , 2011, Proceedings of the National Academy of Sciences.

[30]  R. Ranganathan,et al.  Evolutionarily conserved pathways of energetic connectivity in protein families. , 1999, Science.

[31]  Michael T Laub,et al.  Two-Component Signal Transduction Pathways Regulating Growth and Cell Cycle Progression in a Bacterium: A System-Level Analysis , 2005, PLoS biology.

[32]  T. Hwa,et al.  Identification of direct residue contacts in protein–protein interaction by message passing , 2009, Proceedings of the National Academy of Sciences.

[33]  Sivaraman Balakrishnan,et al.  Learning generative models for protein fold families , 2011, Proteins.

[34]  E. van Nimwegen,et al.  Accurate Prediction of Protein–protein Interactions from Sequence Alignments Using a Bayesian Method , 2022 .

[35]  Xiaozheng Xu,et al.  Mechanistic Insights Revealed by the Crystal Structure of a Histidine Kinase with Signal Transducer and Sensor Domains , 2013, PLoS biology.

[36]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[37]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..

[38]  A. Valencia,et al.  Emerging methods in protein co-evolution , 2013, Nature Reviews Genetics.

[39]  Hendrik Szurmant,et al.  Interaction fidelity in two-component signaling. , 2010, Current opinion in microbiology.

[40]  Thomas A. Hopf,et al.  Three-Dimensional Structures of Membrane Proteins from Genomic Sequencing , 2012, Cell.

[41]  John K Kruschke,et al.  Bayesian data analysis. , 2010, Wiley interdisciplinary reviews. Cognitive science.

[42]  C. Sander,et al.  Can three-dimensional contacts in protein structures be predicted by analysis of correlated mutations? , 1994, Protein engineering.

[43]  J. Hoch,et al.  Multiple histidine kinases regulate entry into stationary phase and sporulation in Bacillus subtilis , 2000, Molecular microbiology.

[44]  F. Morcos,et al.  Genomics-aided structure prediction , 2012, Proceedings of the National Academy of Sciences.

[45]  Andreas Möglich,et al.  Full-length structure of a sensor histidine kinase pinpoints coaxial coiled coils as signal transducers and modulators. , 2013, Structure.

[46]  Simona Cocco,et al.  From Principal Component to Direct Coupling Analysis of Coevolution in Proteins: Low-Eigenvalue Modes are Needed for Structure Prediction , 2012, PLoS Comput. Biol..

[47]  Terence Hwa,et al.  High-resolution protein complexes from integrating genomic information with molecular simulation , 2009, Proceedings of the National Academy of Sciences.

[48]  A. Newton,et al.  The Core Dimerization Domains of Histidine Kinases Contain Recognition Specificity for the Cognate Response Regulator , 2003, Journal of bacteriology.