Disentangling Direct from Indirect Co-Evolution of Residues in Protein Alignments

Predicting protein structure from primary sequence is one of the ultimate challenges in computational biology. Given the large amount of available sequence data, the analysis of co-evolution, i.e., statistical dependency, between columns in multiple alignments of protein domain sequences remains one of the most promising avenues for predicting residues that are contacting in the structure. A key impediment to this approach is that strong statistical dependencies are also observed for many residue pairs that are distal in the structure. Using a comprehensive analysis of protein domains with available three-dimensional structures we show that co-evolving contacts very commonly form chains that percolate through the protein structure, inducing indirect statistical dependencies between many distal pairs of residues. We characterize the distributions of length and spatial distance traveled by these co-evolving contact chains and show that they explain a large fraction of observed statistical dependencies between structurally distal pairs. We adapt a recently developed Bayesian network model into a rigorous procedure for disentangling direct from indirect statistical dependencies, and we demonstrate that this method not only successfully accomplishes this task, but also allows contacts with weak statistical dependency to be detected. To illustrate how additional information can be incorporated into our method, we incorporate a phylogenetic correction, and we develop an informative prior that takes into account that the probability for a pair of residues to contact depends strongly on their primary-sequence distance and the amount of conservation that the corresponding columns in the multiple alignment exhibit. We show that our model including these extensions dramatically improves the accuracy of contact prediction from multiple sequence alignments.

[1]  David Haussler,et al.  Detecting Coevolution in and among Protein Domains , 2007, PLoS Comput. Biol..

[2]  Alfonso Valencia,et al.  Assessment of intramolecular contact predictions for CASP7 , 2007, Proteins.

[3]  B. Rost,et al.  Effective use of sequence correlation and conservation in fold recognition. , 1999, Journal of molecular biology.

[4]  David K. Y. Chiu,et al.  Inferring consensus structure from nucleic acid sequences , 1991, Comput. Appl. Biosci..

[5]  C. Yanofsky,et al.  Protein Structure Relationships Revealed by Mutational Analysis , 1964, Science.

[6]  Robert D. Finn,et al.  iPfam: visualization of protein?Cprotein interactions in PDB at domain and amino acid resolutions , 2005, Bioinform..

[7]  Gregory B. Gloor,et al.  Mutual information without the influence of phylogeny or entropy dramatically improves residue contact prediction , 2008, Bioinform..

[8]  Ramón López de Mántaras,et al.  Tractable Bayesian Learning of Tree Augmented Naive Bayes Models , 2003, ICML.

[9]  R. Ranganathan,et al.  Evolutionarily conserved pathways of energetic connectivity in protein families. , 1999, Science.

[10]  D. Baker,et al.  A simple physical model for binding energy hot spots in protein–protein complexes , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[11]  R. Durbin,et al.  RNA sequence analysis using covariance models. , 1994, Nucleic acids research.

[12]  Simon A. A. Travers,et al.  A Novel Method for Detecting Intramolecular Coevolution: Adding a Further Dimension to Selective Constraints Analyses , 2006, Genetics.

[13]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..

[14]  W. Fitch,et al.  An improved method for determining codon variability in a gene and its application to the rate of fixation of mutations in evolution , 1970, Biochemical Genetics.

[15]  W. Atchley,et al.  Separation of phylogenetic and functional associations in biological sequences by using the parametric bootstrap. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Jesús Cerquides,et al.  Tractable Bayesian Learning of Tree Augmented Naive Bayes Classifiers , 2003 .

[17]  G. Gloor,et al.  Mutual information in protein multiple sequence alignments reveals two classes of coevolving positions. , 2005, Biochemistry.

[18]  Richard W Aldrich,et al.  On Evolutionary Conservation of Thermodynamic Coupling in Proteins* , 2004, Journal of Biological Chemistry.

[19]  W R Taylor,et al.  Coevolving protein residues: maximum likelihood identification and relationship to structure. , 1999, Journal of molecular biology.

[20]  Tommi S. Jaakkola,et al.  Tractable Bayesian learning of tree belief networks , 2000, Stat. Comput..

[21]  Pierre Baldi,et al.  Improved residue contact prediction using support vector machines and a large feature set , 2007, BMC Bioinformatics.

[22]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[23]  Robert D. Finn,et al.  InterPro: the integrative protein signature database , 2008, Nucleic Acids Res..

[24]  L. C. Martin,et al.  Using information theory to search for co-evolving residues in proteins , 2005, Bioinform..

[25]  Kevin Karplus,et al.  Contact prediction using mutual information and neural nets , 2007, Proteins.

[26]  Najeeb M. Halabi,et al.  Protein Sectors: Evolutionary Units of Three-Dimensional Structure , 2009, Cell.

[27]  T. Hwa,et al.  Identification of direct residue contacts in protein–protein interaction by message passing , 2009, Proceedings of the National Academy of Sciences.

[28]  D. Andersson,et al.  Adaptation to the deleterious effects of antimicrobial drug resistance mutations by compensatory evolution. , 2004, Research in microbiology.

[29]  R. Aldrich,et al.  Influence of conservation on calculations of amino acid covariance in multiple sequence alignments , 2004, Proteins.

[30]  E. van Nimwegen,et al.  Accurate Prediction of Protein–protein Interactions from Sequence Alignments Using a Bayesian Method , 2022 .

[31]  Thomas W. H. Lui,et al.  Using multiple interdependency to separate functional from phylogenetic correlations in protein alignments , 2003, Bioinform..

[32]  David S. Eisenberg,et al.  Using inferred residue contacts to distinguish between correct and incorrect protein models , 2008, Bioinform..

[33]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[34]  C. N. Liu,et al.  Approximating discrete probability distributions with dependence trees , 1968, IEEE Trans. Inf. Theory.

[35]  Alfonso Valencia,et al.  Protein co-evolution, co-adaptation and interactions , 2008, The EMBO journal.

[36]  Anders Gorm Pedersen,et al.  Finding coevolving amino acid residues using row and column weighting of mutual information and multi-dimensional amino acid representation , 2007, Algorithms for molecular biology : AMB.

[37]  B. Rost,et al.  Conservation and prediction of solvent accessibility in protein families , 1994, Proteins.

[38]  Paul P. Gardner,et al.  Sequence analysis Measuring covariation in RNA alignments : physical realism improves information measures , 2006 .