Graphical Models of Residue Coupling in Protein Families

Many statistical measures and algorithmic techniques have been proposed for studying residue coupling in protein families. Generally speaking, two residue positions are considered coupled if, in the sequence record, some of their amino acid type combinations are significantly more common than others. While the proposed approaches have proven useful in finding and describing coupling, a significant missing component is a formal probabilistic model that explicates and compactly represents the coupling, integrates information about sequence, structure, and function, and supports inferential procedures for analysis, diagnosis, and prediction. We present an approach to learning and using probabilistic graphical models of residue coupling (GMRCs). These models capture significant conservation and coupling constraints observable in a multiply aligned set of sequences. Our approach can place a structural prior on considered couplings, so that all identified relationships have direct mechanistic explanations. It can also incorporate information about functional classes, and thereby learn a differential graphical model that distinguishes constraints common to all classes from those unique to individual classes. Such differential models separately account for class-specific conservation and family- wide coupling, two different sources of sequence covariation. They are then able to perform interpretable functional classification of new sequences, explaining classification decisions in terms of the underlying conservation and coupling constraints. We apply our approach in studying both G protein-coupled receptors and PDZ domains, identifying and analyzing family-wide and class-specific constraints, and performing functional classification. The results demonstrate that GMRCs provide a powerful tool for uncovering, representing, and utilizing significant sequence-structure-function relationships in protein families.

[1]  Steffen L. Lauritzen,et al.  Graphical models in R , 1996 .

[2]  F. Cohen,et al.  An evolutionary trace method defines binding surfaces common to protein families. , 1996, Journal of molecular biology.

[3]  Gajendra P. S. Raghava,et al.  GPCRpred: an SVM-based method for prediction of families and subfamilies of G-protein coupled receptors , 2004, Nucleic Acids Res..

[4]  Andrea Califano,et al.  Motif-based construction of a functional map for mammalian olfactory receptors. , 2003, Genomics.

[5]  I. Grigoriev,et al.  Detection of protein fold similarity based on correlation of amino acid properties. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[7]  A. Horovitz,et al.  Double-mutant cycles: a powerful tool for analyzing protein structure and function. , 1996, Folding & design.

[8]  K Schulten,et al.  VMD: visual molecular dynamics. , 1996, Journal of molecular graphics.

[9]  C. Sander,et al.  Correlated mutations and residue contacts in proteins , 1994, Proteins.

[10]  M. Milik,et al.  Common Structural Cliques: a tool for protein structure and function analysis. , 2003, Protein engineering.

[11]  Nir Friedman,et al.  Learning Bayesian Network Structure from Massive Datasets: The "Sparse Candidate" Algorithm , 1999, UAI.

[12]  K. Karplus REGULARIZERS FOR ESTIMATING DISTRIBUTIONS OF AMINO ACIDS FROM SMALL SAMPLES , 1995 .

[13]  David Haussler,et al.  Classifying G-protein coupled receptors with support vector machines , 2002, Bioinform..

[14]  W. S. Valdar,et al.  Scoring residue conservation , 2002, Proteins.

[15]  Wray L. Buntine Operations for Learning with Graphical Models , 1994, J. Artif. Intell. Res..

[16]  David R. Karger,et al.  Learning Markov networks: maximum bounded tree-width graphs , 2001, SODA '01.

[17]  H. Weinstein,et al.  Databases and ontologies PDZBase : a protein – protein interaction database for PDZ-domains , 2005 .

[18]  A. Lapedes,et al.  Covariation of mutations in the V3 loop of human immunodeficiency virus type 1 envelope protein: an information theoretic analysis. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[19]  William R. Atchley,et al.  Positional Dependence, Cliques, and Predictive Motifs in the bHLH Protein Domain , 1999, Journal of Molecular Evolution.

[20]  Gürol M. Süel,et al.  Evolutionarily conserved networks of residues mediate allosteric communication in proteins , 2003, Nature Structural Biology.

[21]  R. Aldrich,et al.  Influence of conservation on calculations of amino acid covariance in multiple sequence alignments , 2004, Proteins.

[22]  Richard W. Aldrich,et al.  A perturbation-based method for calculating explicit likelihood of evolutionary co-variance in multiple sequence alignments , 2004, Bioinform..

[23]  O. Schueler‐Furman,et al.  Conserved residue clustering and protein structure prediction , 2003, Proteins.

[24]  M. Sheng,et al.  PDZ Domains: Structural Modules for Protein Complex Assembly* , 2002, The Journal of Biological Chemistry.

[25]  W. P. Russ,et al.  Natural-like function in artificial WW domains , 2005, Nature.

[26]  Richard W Aldrich,et al.  On Evolutionary Conservation of Thermodynamic Coupling in Proteins* , 2004, Journal of Biological Chemistry.

[27]  Matthew W. Dimmic,et al.  Detecting coevolving amino acid sites using Bayesian mutational mapping , 2005, ISMB.

[28]  M. Drton,et al.  Model selection for Gaussian concentration graphs , 2004 .

[29]  W. Lim,et al.  Mechanism and role of PDZ domains in signaling complex assembly. , 2001, Journal of cell science.

[30]  Christopher A. Voigt,et al.  Protein building blocks preserved by recombination , 2002, Nature Structural Biology.

[31]  Geoff Hulten,et al.  Mining complex models from arbitrarily large databases in constant time , 2002, KDD.

[32]  M. Pagel Detecting correlated evolution on phylogenies: a general method for the comparative analysis of discrete characters , 1994, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[33]  R. Ranganathan,et al.  Evolutionarily conserved pathways of energetic connectivity in protein families. , 1999, Science.

[34]  W. Taylor,et al.  Effectiveness of correlation analysis in identifying protein residues undergoing correlated evolution. , 1997, Protein engineering.

[35]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[36]  A. Horovitz,et al.  Detection and reduction of evolutionary noise in correlated mutation analysis. , 2005, Protein engineering, design & selection : PEDS.

[37]  G Vriend,et al.  Correlated Mutation Analyses on Very Large Sequence Families , 2002, Chembiochem : a European journal of chemical biology.

[38]  A. Horovitz,et al.  Mapping pathways of allosteric communication in GroEL by analysis of correlated mutations , 2002, Proteins.

[39]  Costas D Maranas,et al.  Using multiple sequence correlation analysis to characterize functionally important protein regions. , 2003, Protein engineering.

[40]  Rama Ranganathan,et al.  Knowledge-based potential functions in protein design. , 2002, Current opinion in structural biology.

[41]  T C Terwilliger,et al.  Engineering multiple properties of a protein by combinatorial mutagenesis. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[42]  Chris Bailey-Kellogg,et al.  Site‐directed combinatorial construction of chimaeric genes: General method for optimizing assembly of gene fragments , 2006, Proteins.

[43]  W. P. Russ,et al.  Evolutionary information for specifying a protein fold , 2005, Nature.

[44]  W R Taylor,et al.  Coevolving protein residues: maximum likelihood identification and relationship to structure. , 1999, Journal of molecular biology.

[45]  Gert Vriend,et al.  Collecting and harvesting biological data: the GPCRDB and NucleaRDB information systems , 2001, Nucleic Acids Res..

[46]  Andrea Califano,et al.  CASTOR: Clustering Algorithm for Sequence Taxonomical Organization and Relationships , 2003, J. Comput. Biol..

[47]  A. Valencia,et al.  Correlated mutations contain information about protein-protein interaction. , 1997, Journal of molecular biology.

[48]  Manfred Burghammer,et al.  Structure of bovine rhodopsin in a trigonal crystal form. , 2003, Journal of molecular biology.

[49]  Chris Bailey-Kellogg,et al.  Hypergraph Model of Multi-residue Interactions in Proteins: Sequentially-Constrained Partitioning Algorithms for Optimization of Site-Directed Protein Recombination , 2006, RECOMB.

[50]  B. Rost,et al.  Effective use of sequence correlation and conservation in fold recognition. , 1999, Journal of molecular biology.

[51]  C. Bailey-Kellogg,et al.  Graphical Models of Residue Coupling in Protein Families , 2008, TCBB.

[52]  N. Ben-Tal,et al.  ConSurf: an algorithmic tool for the identification of functional regions in proteins by surface mapping of phylogenetic information. , 2001, Journal of molecular biology.