Inference of Functionally-Relevant N-acetyltransferase Residues Based on Statistical Correlations

Over evolutionary time, members of a superfamily of homologous proteins sharing a common structural core diverge into subgroups filling various functional niches. At the sequence level, such divergence appears as correlations that arise from residue patterns distinct to each subgroup. Such a superfamily may be viewed as a population of sequences corresponding to a complex, high-dimensional probability distribution. Here we model this distribution as hierarchical interrelated hidden Markov models (hiHMMs), which describe these sequence correlations implicitly. By characterizing such correlations one may hope to obtain information regarding functionally-relevant properties that have thus far evaded detection. To do so, we infer a hiHMM distribution from sequence data using Bayes’ theorem and Markov chain Monte Carlo (MCMC) sampling, which is widely recognized as the most effective approach for characterizing a complex, high dimensional distribution. Other routines then map correlated residue patterns to available structures with a view to hypothesis generation. When applied to N-acetyltransferases, this reveals sequence and structural features indicative of functionally important, yet generally unknown biochemical properties. Even for sets of proteins for which nothing is known beyond unannotated sequences and structures, this can lead to helpful insights. We describe, for example, a putative coenzyme-A-induced-fit substrate binding mechanism mediated by arginine residue switching between salt bridge and π-π stacking interactions. A suite of programs implementing this approach is available (psed.igs.umaryland.edu).

[1]  Andrew F Neuwald,et al.  The hallmark of AGC kinase functional divergence is its C-terminal tail, a cis-acting regulatory module , 2007, Proceedings of the National Academy of Sciences.

[2]  S. Henikoff,et al.  Position-based sequence weights. , 1994, Journal of molecular biology.

[3]  J. Ioannidis Why Most Published Research Findings Are False , 2005, PLoS medicine.

[4]  Mario Vazdar,et al.  Like-charge guanidinium pairing from molecular dynamics and ab initio calculations. , 2011, The journal of physical chemistry. A.

[5]  Terence Hwa,et al.  Direct coupling analysis for protein contact prediction. , 2014, Methods in molecular biology.

[6]  Jun S. Liu,et al.  Bayesian Models for Multiple Local Sequence Alignment and Gibbs Sampling Strategies , 1995 .

[7]  Andrew F Neuwald,et al.  Protein domain hierarchy Gibbs sampling strategies , 2014, Statistical applications in genetics and molecular biology.

[8]  István Miklós,et al.  StatAlign: an extendable software package for joint Bayesian estimation of alignments and evolutionary trees , 2008, Bioinform..

[9]  C. Sander,et al.  All-atom 3D structure prediction of transmembrane β-barrel proteins from sequences , 2015, Proceedings of the National Academy of Sciences.

[10]  L. Mirny,et al.  Using orthologous and paralogous proteins to identify specificity determining residues , 2002, Genome Biology.

[11]  A. F. Neuwald,et al.  Did protein kinase regulatory mechanisms evolve through elaboration of a simple structural component? , 2005, Journal of molecular biology.

[12]  E. Aurell,et al.  Improved contact prediction in proteins: using pseudolikelihoods to infer Potts models. , 2012, Physical review. E, Statistical, nonlinear, and soft matter physics.

[13]  David Haussler,et al.  Using Dirichlet Mixture Priors to Derive Hidden Markov Models for Protein Families , 1993, ISMB.

[14]  G J Barton,et al.  Identification of functional residues and secondary structure from protein multiple sequence alignment. , 1996, Methods in enzymology.

[15]  O. Lichtarge,et al.  A family of evolution-entropy hybrid methods for ranking protein residues by importance. , 2004, Journal of molecular biology.

[16]  Jun S. Liu,et al.  Ran's C-terminal, basic patch, and nucleotide exchange mechanisms in light of a canonical structure for Rab, Rho, Ras, and Ran GTPases. , 2003, Genome research.

[17]  Mona Singh,et al.  Characterization and prediction of residues determining protein functional specificity , 2008, Bioinform..

[18]  István Miklós,et al.  Bayesian coestimation of phylogeny and sequence alignment , 2005, BMC Bioinformatics.

[19]  Jotun Hein,et al.  Simultaneous Bayesian estimation of alignment and phylogeny under a joint model of protein sequence and structure. , 2014, Molecular biology and evolution.

[20]  J M Kates Optimal estimation of hearing-aid compression parameters. , 1993, The Journal of the Acoustical Society of America.

[21]  Stephen F. Altschul,et al.  Bayesian Top-Down Protein Sequence Alignment with Inferred Position-Specific Gap Penalties , 2016, PLoS Comput. Biol..

[22]  Jochen Bauer,et al.  H2rs: Deducing evolutionary and functionally important residue positions by means of an entropy and similarity based analysis of multiple sequence alignments , 2014, BMC Bioinformatics.

[23]  M. Gelfand,et al.  Automated selection of positions determining functional specificity of proteins by comparative analysis of orthologous groups in protein families , 2004, Protein science : a publication of the Protein Society.

[24]  Andrew F Neuwald,et al.  Bayesian shadows of molecular mechanisms cast in the light of evolution. , 2006, Trends in biochemical sciences.

[25]  F. Cohen,et al.  An evolutionary trace method defines binding surfaces common to protein families. , 1996, Journal of molecular biology.

[26]  Andrew F Neuwald Surveying the Manifold Divergence of an Entire Protein Class for Statistical Clues to Underlying Biochemical Mechanisms , 2011, Statistical applications in genetics and molecular biology.

[27]  Johannes Söding,et al.  Prediction of protein functional residues from sequence by probability density estimation , 2008, Bioinform..

[28]  Andrew F. Neuwald,et al.  Rapid detection, classification and accurate alignment of up to a million or more related protein sequences , 2009, Bioinform..

[29]  Thomas A. Hopf,et al.  Protein structure prediction from sequence variation , 2012, Nature Biotechnology.

[30]  Predrag Radivojac,et al.  The impact of incomplete knowledge on the evaluation of protein function prediction: a structured-output learning perspective , 2014, Bioinform..

[31]  Jordan L. Boyd-Graber,et al.  Dirichlet Mixtures, the Dirichlet Process, and the Structure of Protein Space , 2013, J. Comput. Biol..

[32]  Thomas A. Hopf,et al.  Three-Dimensional Structures of Membrane Proteins from Genomic Sequencing , 2012, Cell.

[33]  R. Tata,et al.  l-Methionine sulfoximine, but not phosphinothricin, is a substrate for an acetyltransferase (gene PA4866) from Pseudomonas aeruginosa: structural and functional studies. , 2007, Biochemistry.

[34]  M. Lanotte,et al.  Identification and characterization of the human ARD1-NATH protein acetyltransferase complex. , 2005, The Biochemical journal.

[35]  R. Russell,et al.  Analysis and prediction of functional sub-types from protein sequence alignments. , 2000, Journal of molecular biology.

[36]  K. Katoh,et al.  MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. , 2002, Nucleic acids research.

[37]  Kristy L. Hentchel,et al.  In Salmonella enterica, the Gcn5-Related Acetyltransferase MddA (Formerly YncA) Acetylates Methionine Sulfoximine and Methionine Sulfone, Blocking Their Toxic Effects , 2014, Journal of bacteriology.

[38]  Jun S. Liu,et al.  Monte Carlo strategies in scientific computing , 2001 .

[39]  Debora S. Marks,et al.  Inferring Pairwise Interactions from Biological Data Using Maximum-Entropy Probability Models , 2015, PLoS Comput. Biol..

[40]  M. Suchard,et al.  Joint Bayesian estimation of alignment and phylogeny. , 2005, Systematic biology.

[41]  Haruki Nakamura,et al.  Data Deposition and Annotation at the Worldwide Protein Data Bank , 2009, Molecular biotechnology.

[42]  Jun S. Liu,et al.  Gapped alignment of protein sequence motifs through Monte Carlo optimization of a hidden Markov model , 2004, BMC Bioinformatics.

[43]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[44]  Thomas A. Hopf,et al.  Sequence co-evolution gives 3D contacts and structures of protein complexes , 2014, eLife.

[45]  Anders Krogh,et al.  Hidden Markov models for sequence analysis: extension and analysis of the basic method , 1996, Comput. Appl. Biosci..

[46]  Andrew F. Neuwald Evaluating, Comparing, and Interpreting Protein Domain Hierarchies , 2014, J. Comput. Biol..

[47]  L. Mirny,et al.  Using evolutionary information to find specificity-determining and co-evolving residues. , 2009, Methods in molecular biology.

[48]  K. Katoh,et al.  MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability , 2013, Molecular biology and evolution.

[49]  Christophe Dessimoz,et al.  CAFA and the open world of protein function predictions. , 2013, Trends in genetics : TIG.

[50]  Andrew F. Neuwald,et al.  Evolutionary clues to DNA polymerase III β clamp structural mechanisms , 2003 .

[51]  Helen Attrill,et al.  Structural and biochemical characterization of a trapped coenzyme A adduct of Caenorhabditis elegans glucosamine-6-phosphate N-acetyltransferase 1 , 2012, Acta crystallographica. Section D, Biological crystallography.

[52]  R. Ranganathan,et al.  Evolutionarily conserved pathways of energetic connectivity in protein families. , 1999, Science.

[53]  Xugang Ye,et al.  On the Inference of Dirichlet Mixture Priors for Protein Sequence Comparison , 2011, J. Comput. Biol..

[54]  D. Higgins,et al.  Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega , 2011, Molecular systems biology.

[55]  Andrew F. Neuwald,et al.  A Bayesian Sampler for Optimization of Protein Domain Hierarchies , 2014, J. Comput. Biol..

[56]  C. Sander,et al.  Direct-coupling analysis of residue coevolution captures native contacts across many protein families , 2011, Proceedings of the National Academy of Sciences.

[57]  Rainer Merkl,et al.  CLIPS-1D: analysis of multiple sequence alignments to deduce for residue-positions a role in catalysis, ligand-binding, or protein structure , 2011, BMC Bioinformatics.

[58]  Kazutaka Katoh,et al.  MAFFT: iterative refinement and additional methods. , 2014, Methods in molecular biology.

[59]  Jun S. Liu,et al.  Markovian structures in biological sequence alignments , 1999 .

[60]  Fabian Sievers,et al.  Clustal Omega, accurate alignment of very large numbers of sequences. , 2014, Methods in molecular biology.

[61]  Benjamin D. Redelings,et al.  BAli-Phy: simultaneous Bayesian inference of alignment and phylogeny , 2006, Bioinform..

[62]  Andrew F Neuwald,et al.  The glycine brace: a component of Rab, Rho, and Ran GTPases associated with hinge regions of guanine- and phosphate-binding loops , 2009, BMC Structural Biology.

[63]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[64]  Kimmen Sjölander,et al.  INTREPID—INformation-theoretic TREe traversal for Protein functional site IDentification , 2008, Bioinform..

[65]  Predrag Radivojac,et al.  Computational methods for identification of functional residues in protein structures. , 2011, Current protein & peptide science.

[66]  P. Grünwald The Minimum Description Length Principle (Adaptive Computation and Machine Learning) , 2007 .

[67]  Robert C. Edgar,et al.  MUSCLE: a multiple sequence alignment method with reduced time and space complexity , 2004, BMC Bioinformatics.

[68]  J. Heringa,et al.  Sequence comparison by sequence harmony identifies subtype-specific functional sites , 2006, Nucleic acids research.

[69]  M. Parliament,et al.  Radiogenomics: associations in all the wrong places? , 2012, The Lancet. Oncology.

[70]  Jukka Corander,et al.  Bayesian search of functionally divergent protein subgroups and their function specific residues , 2006, Bioinform..

[71]  Jirí Vondrásek,et al.  The molecular origin of like-charge arginine-arginine pairing in water. , 2009, The journal of physical chemistry. B.

[72]  Kai Ye,et al.  Multi-RELIEF: a method to recognize specificity determining residues from multiple sequence alignments using a Machine-Learning approach for feature weighting , 2008, Bioinform..

[73]  Andrew F. Neuwald,et al.  Gα–Gβγ dissociation may be due to retraction of a buried lysine and disruption of an aromatic cluster by a GTP‐sensing Arg–Trp pair , 2007 .

[74]  Anna R. Panchenko,et al.  Ensemble approach to predict specificity determinants: benchmarking and validation , 2009, BMC Bioinformatics.

[75]  C. Sander,et al.  A method to predict functional residues in proteins , 1995, Nature Structural Biology.

[76]  Andrew F Neuwald,et al.  Evolutionary constraints associated with functional specificity of the CMGC protein kinases MAPK, CDK, GSK, SRPK, DYRK, and CK2α , 2004, Protein science : a publication of the Protein Society.

[77]  John R. Davidson,et al.  SATCHMO-JS: a webserver for simultaneous protein multiple sequence alignment and phylogenetic tree construction , 2010, Nucleic Acids Res..

[78]  Angela D. Wilkins,et al.  Evolutionary trace for prediction and redesign of protein functional sites. , 2012, Methods in molecular biology.

[79]  John C. Wootton,et al.  The Construction and Use of Log-Odds Substitution Scores for Multiple Sequence Alignment , 2010, PLoS Comput. Biol..

[80]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[81]  Andrew F Neuwald,et al.  The charge-dipole pocket: a defining feature of signaling pathway GTPase on/off switches. , 2009, Journal of molecular biology.

[82]  Xun Gu,et al.  Predicting functional divergence in protein evolution by site-specific rate shifts. , 2002, Trends in biochemical sciences.

[83]  G. Mendel Versuche über Pflanzen-Hybriden , 1941, Der Zauchter Zeitschrift fur Theoretische und Angewandte Genetik.

[84]  Christopher J. Lanczycki,et al.  Automated hierarchical classification of protein domain subfamilies based on functionally-divergent residue signatures , 2012, BMC Bioinformatics.

[85]  Ilan Vardi,et al.  Computational recreations in Mathematica , 1991 .

[86]  Abhijit R. Tendulkar,et al.  Electrostatics‐defying interaction between arginine termini as a thermodynamic driving force in protein–protein interaction , 2009, Proteins.

[87]  Zihe Rao,et al.  Crystal structure of tabtoxin resistance protein complexed with acetyl coenzyme A reveals the mechanism for beta-lactam acetylation. , 2003, Journal of molecular biology.

[88]  Robert B. Russell,et al.  Combining specificity determining and conserved residues improves functional site prediction , 2009, BMC Bioinformatics.

[89]  T. Koshy Catalan Numbers with Applications , 2008 .

[90]  Najeeb M. Halabi,et al.  Protein Sectors: Evolutionary Units of Three-Dimensional Structure , 2009, Cell.

[91]  Serita M. Nelesen,et al.  SATe-II: very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees. , 2012, Systematic biology.

[92]  Andrew F. Neuwald,et al.  Identification and classification of small molecule kinases: insights into substrate recognition and specificity , 2016, BMC Evolutionary Biology.

[93]  Abhijit Chakraborty,et al.  A survey on prediction of specificity-determining sites in proteins , 2015, Briefings Bioinform..

[94]  Matthew W Vetting,et al.  Mechanistic and structural analysis of human spermidine/spermine N1-acetyltransferase. , 2007, Biochemistry.