Probabilistic grammatical model for helix‐helix contact site classification

BackgroundHidden Markov Models power many state‐of‐the‐art tools in the field of protein bioinformatics. While excelling in their tasks, these methods of protein analysis do not convey directly information on medium‐ and long‐range residue‐residue interactions. This requires an expressive power of at least context‐free grammars. However, application of more powerful grammar formalisms to protein analysis has been surprisingly limited.ResultsIn this work, we present a probabilistic grammatical framework for problem‐specific protein languages and apply it to classification of transmembrane helix‐helix pairs configurations. The core of the model consists of a probabilistic context‐free grammar, automatically inferred by a genetic algorithm from only a generic set of expert‐based rules and positive training samples. The model was applied to produce sequence based descriptors of four classes of transmembrane helix‐helix contact site configurations. The highest performance of the classifiers reached AUCROC of 0.70. The analysis of grammar parse trees revealed the ability of representing structural features of helix‐helix contact sites.ConclusionsWe demonstrated that our probabilistic context‐free framework for analysis of protein sequences outperforms the state of the art in the task of helix‐helix contact site classification. However, this is achieved without necessarily requiring modeling long range dependencies between interacting residues. A significant feature of our approach is that grammar rules and parse trees are human‐readable. Thus they could provide biologically meaningful information for molecular biologists.

[1]  Damián López,et al.  IgTM: An algorithm to predict transmembrane domains and topology in proteins , 2008, BMC Bioinformatics.

[2]  Saraswathi Vishveshwara,et al.  Inter-helical Interactions in Membrane Proteins: Analysis Based on the Local Backbone Geometry and the Side Chain Interactions , 2009, Journal of biomolecular structure & dynamics.

[3]  Yang Zhang,et al.  A comprehensive assessment of sequence-based and template-based methods for protein contact prediction , 2008, Bioinform..

[4]  T. Head Formal language theory and DNA: an analysis of the generative capacity of specific recombinant behaviors. , 1987, Bulletin of mathematical biology.

[5]  Jean-Christophe Nebel,et al.  A stochastic context free grammar based framework for analysis of protein sequences , 2009, BMC Bioinformatics.

[6]  Bill Keller,et al.  Evolutionary induction of stochastic context free grammars , 2005, Pattern Recognit..

[7]  Ilya A Vakser,et al.  Shorter side chains optimize helix–helix packing , 2004, Protein science : a publication of the Protein Society.

[8]  David T. Jones,et al.  Predicting Transmembrane Helix Packing Arrangements using Residue Contacts and a Force-Directed Algorithm , 2010, PLoS Comput. Biol..

[9]  Samuel L. DeLuca,et al.  Practically Useful: What the Rosetta Protein Modeling Suite Can Do for You , 2010, Biochemistry.

[10]  Thomas A. Hopf,et al.  Protein 3D Structure Computed from Evolutionary Sequence Variation , 2011, PloS one.

[11]  D. Engelman,et al.  The GxxxG motif: a framework for transmembrane helix-helix association. , 2000, Journal of molecular biology.

[12]  P. Laplace Théorie analytique des probabilités , 1995 .

[13]  D. Haussler,et al.  Stochastic context-free grammars for modeling RNA , 1993, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

[14]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[15]  D. Schneider,et al.  From interactions of single transmembrane helices to folding of alpha-helical membrane proteins: analyzing transmembrane helix-helix interactions in bacteria. , 2007, Current protein & peptide science.

[16]  Leslie G. Valiant,et al.  General Context-Free Recognition in Less than Cubic Time , 1975, J. Comput. Syst. Sci..

[17]  SödingJohannes Protein homology detection by HMM--HMM comparison , 2005 .

[18]  David B. Searls,et al.  Grammatical Representations of Macromolecular Structure , 2006, J. Comput. Biol..

[19]  C. Sander,et al.  Direct-coupling analysis of residue coevolution captures native contacts across many protein families , 2011, Proceedings of the National Academy of Sciences.

[20]  Amos Bairoch,et al.  PROSITE: A Documented Database Using Patterns and Profiles as Motif Descriptors , 2002, Briefings Bioinform..

[21]  J. Gibrat,et al.  Secondary structure prediction: combination of three different methods. , 1988, Protein engineering.

[22]  D. Searls,et al.  Robots in invertebrate neuroscience , 2002, Nature.

[23]  Gerald Gazdar,et al.  Applicability of Indexed Grammars to Natural Languages , 1988 .

[24]  Andrzej Kolinski,et al.  Contact prediction in protein modeling: Scoring, folding and refinement of coarse-grained models , 2008, BMC Structural Biology.

[25]  Christian J. A. Sigrist,et al.  Nucleic Acids Research Advance Access published November 14, 2007 The 20 years of PROSITE , 2007 .

[26]  R. C. Underwood,et al.  Stochastic context-free grammars for tRNA modeling. , 1994, Nucleic acids research.

[27]  D. Frishman,et al.  Prediction of helix–helix contacts and interacting helices in polytopic membrane proteins using neural networks , 2009, Proteins.

[28]  S. O. Smith,et al.  Internal packing of helical membrane proteins. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[29]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[30]  Jean-Christophe Nebel,et al.  Towards 3D Modeling of Interacting TM Helix Pairs Based on Classification of Helix Pair Sequence , 2010, PRIB.

[31]  Michael Lappe,et al.  Defining an Essence of Structure Determining Residue Contacts in Proteins , 2009, PLoS Comput. Biol..

[32]  P Fariselli,et al.  Prediction of contact maps with neural networks and correlated mutations. , 2001, Protein engineering.

[33]  Barry Honig,et al.  Helical packing patterns in membrane and soluble proteins. , 2004, Biophysical journal.

[34]  Hyun-Seok Park,et al.  Recent Progresses in the Linguistic Modeling of Biological Sequences Based on Formal Language Theory , 2011 .

[35]  David B. Searls,et al.  The computational linguistics of biological sequences , 1993, ISMB 1995.

[36]  Jean-Christophe Nebel,et al.  Accuracy in Predicting Secondary Structure of Ionic Channels , 2009, ICCCI.

[37]  Damián López,et al.  Transducer Inference by Assembling Specific Languages , 2010, ICGI.

[38]  P. Bradley,et al.  Toward High-Resolution de Novo Structure Prediction for Small Proteins , 2005, Science.

[39]  David S. Wishart,et al.  PROTEUS2: a web server for comprehensive protein structure prediction and structure-based annotation , 2008, Nucleic Acids Res..

[40]  W. DeGrado,et al.  Helix-packing motifs in membrane proteins , 2006, Proceedings of the National Academy of Sciences.

[41]  J. Thornton,et al.  PROMOTIF—A program to identify and analyze structural motifs in proteins , 1996, Protein science : a publication of the Protein Society.

[42]  Foster J. Provost,et al.  Confidence Bands for Roc Curves , 2004, ROCAI.

[43]  Kent A. Spackman,et al.  Signal Detection Theory: Valuable Tools for Evaluating Inductive Learning , 1989, ML.

[44]  Bjarne Knudsen,et al.  RNA secondary structure prediction using stochastic context-free grammars and evolutionary history , 1999, Bioinform..

[45]  Terence Hwa,et al.  Inference of direct residue contacts in two-component signaling. , 2010, Methods in enzymology.

[46]  Wen-Lian Hsu,et al.  TMPad: an integrated structural database for helix-packing folds in transmembrane proteins , 2010, Nucleic Acids Res..

[47]  Thomas A. Hopf,et al.  Three-Dimensional Structures of Membrane Proteins from Genomic Sequencing , 2012, Cell.

[48]  M. Kanehisa,et al.  Analysis of amino acid indices and mutation matrices for sequence comparison and structure prediction of proteins. , 1996, Protein engineering.

[49]  J. Steyaert Modeling and Predicting All-α Transmembrane Proteins Including Helix-Helix Pairing , 2022 .

[50]  B. Vassilev,et al.  Structural fragment clustering reveals novel structural and functional motifs in α-helical transmembrane proteins , 2010, BMC Bioinformatics.

[51]  Johannes Söding,et al.  The HHpred interactive server for protein homology detection and structure prediction , 2005, Nucleic Acids Res..

[52]  Giorgio Satta,et al.  Estimation of Consistent Probabilistic Context-free Grammars , 2006, HLT-NAACL.

[53]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[54]  Yasubumi Sakakibara,et al.  Grammatical inference in bioinformatics , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[55]  Gevorg Grigoryan,et al.  Probing designability via a generalized model of helical bundle geometry. , 2011, Journal of molecular biology.

[56]  David Thomas,et al.  A sequence and structural study of transmembrane helices , 2001, J. Comput. Aided Mol. Des..

[57]  John Cocke,et al.  Programming languages and their compilers: Preliminary notes , 1969 .

[58]  Rajgopal Srinivasan,et al.  Recursive domains in proteins , 2002, Protein science : a publication of the Protein Society.

[59]  Lawrence Hunter,et al.  Artificial Intelligence and Molecular Biology , 1992, AI Mag..

[60]  Robert D. Finn,et al.  HMMER web server: interactive sequence similarity searching , 2011, Nucleic Acids Res..

[61]  Hermann Ney,et al.  Dynamic programming parsing for context-free grammars in continuous speech recognition , 1991, IEEE Trans. Signal Process..

[62]  Sean Wallis,et al.  Binomial Confidence Intervals and Contingency Tests: Mathematical Fundamentals and the Evaluation of Alternative Methods , 2013, J. Quant. Linguistics.

[63]  Naoki Abe,et al.  Predicting Protein Secondary Structure Using Stochastic Tree Grammars , 1997, Machine Learning.

[64]  I. Longden,et al.  EMBOSS: the European Molecular Biology Open Software Suite. , 2000, Trends in genetics : TIG.

[65]  Noam Chomsky,et al.  On Certain Formal Properties of Grammars , 1959, Inf. Control..

[66]  Robert D. Finn,et al.  The Pfam protein families database , 2004, Nucleic Acids Res..

[67]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[68]  György E. Révész Introduction to formal languages , 1983 .

[69]  Wen-Lian Hsu,et al.  Enhanced membrane protein topology prediction using a hierarchical classification method and a new scoring function. , 2008, Journal of proteome research.

[70]  K. Sonmez,et al.  Designing antimicrobial peptides with weighted finite-state transducers , 2010, 2010 Annual International Conference of the IEEE Engineering in Medicine and Biology.

[71]  E. Eugene Schultz,et al.  Hawaii international conference on system sciences , 1992, SGCH.

[72]  István Simon,et al.  Topology prediction of helical transmembrane proteins: how far have we reached? , 2010, Current protein & peptide science.

[73]  D Baker,et al.  Prediction of membrane protein structures with complex topologies using limited constraints , 2009, Proceedings of the National Academy of Sciences.

[74]  Wen-Lian Hsu,et al.  Predicting helix–helix interactions from residue contacts in membrane proteins , 2009, Bioinform..

[75]  Jean-Marc Steyaert,et al.  Modeling and predicting all-alpha transmembrane proteins including helix-helix pairing , 2005, Theor. Comput. Sci..

[76]  John R. Koza,et al.  Genetic programming - on the programming of computers by means of natural selection , 1993, Complex adaptive systems.

[77]  A. Biegert,et al.  HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment , 2011, Nature Methods.

[78]  Sohan Seth,et al.  Neuronal functional connectivity dynamics in cortex: An MSC-based analysis , 2010, 2010 Annual International Conference of the IEEE Engineering in Medicine and Biology.

[79]  D. Haussler,et al.  An RNA gene expressed during cortical development evolved rapidly in humans , 2006, Nature.

[80]  Yasubumi Sakakibara,et al.  Learning context-free grammars using tabular representations , 2005, Pattern Recognit..

[81]  Friedrich L. Bauer,et al.  A syntax controlled generator of formal language processors , 1963, CACM.

[82]  Bjarne Knudsen,et al.  Pfold: RNA Secondary Structure Prediction Using Stochastic Context-Free Grammars , 2003 .

[83]  Minoru Kanehisa,et al.  AAindex: amino acid index database, progress report 2008 , 2007, Nucleic Acids Res..

[84]  John Moult,et al.  A decade of CASP: progress, bottlenecks and prognosis in protein structure prediction. , 2005, Current opinion in structural biology.

[85]  D. Baker,et al.  Multipass membrane protein structure prediction using Rosetta , 2005, Proteins.

[86]  Oliver F. Lange,et al.  Structure prediction for CASP8 with all‐atom refinement using Rosetta , 2009, Proteins.

[87]  Johannes Söding,et al.  Protein homology detection by HMM?CHMM comparison , 2005, Bioinform..

[88]  C. Sander,et al.  Correlated mutations and residue contacts in proteins , 1994, Proteins.

[89]  W. Taylor,et al.  Global fold determination from a small number of distance restraints. , 1995, Journal of molecular biology.

[90]  Sean R. Eddy,et al.  Accelerated Profile HMM Searches , 2011, PLoS Comput. Biol..

[91]  Daniel H. Younger,et al.  Recognition and Parsing of Context-Free Languages in Time n^3 , 1967, Inf. Control..

[92]  Conor Ryan,et al.  Grammatical evolution , 2007, GECCO '07.

[93]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[94]  Jean-Christophe Nebel,et al.  A probabilistic context-free grammar for the detection of binding sites from a protein sequence , 2007, BMC Systems Biology.

[95]  Satoshi Kobayashi,et al.  Learning local languages and its application to protein /spl alpha/-chain identification , 1994, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

[96]  Dino Ienco,et al.  Annotated Stochastic Context Free Grammars for Analysis and Synthesis of Proteins , 2011, EvoBio.

[97]  Peter Clote,et al.  Predicting transmembrane β‐barrels and interstrand residue interactions from sequence , 2006, Proteins.

[98]  Aravind K. Joshi,et al.  A Grammatical Theory for the Conformational Changes of Simple Helix Bundles , 2006, J. Comput. Biol..

[99]  L. Kier,et al.  Amino acid side chain parameters for correlation studies in biology and pharmacology. , 2009, International journal of peptide and protein research.

[100]  Prasanna R Kolatkar,et al.  Assessment of CASP7 structure predictions for template free targets , 2007, Proteins.

[101]  Peter Clote,et al.  transFold: a web server for predicting the structure and residue contacts of transmembrane beta-barrels , 2006, Nucleic Acids Res..

[102]  Massimiliano Pontil,et al.  PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments , 2012, Bioinform..

[103]  Jeffrey Skolnick,et al.  Benchmarking of TASSER_2.0: an improved protein structure prediction algorithm with more accurate predicted contact restraints. , 2008, Biophysical journal.

[104]  Peter Staudacher,et al.  New Frontiers Beyond Context-Freeness: Di-Grammars and Di-Automata. , 1993, EACL.

[105]  Witold Dyrka,et al.  Probabilistic context-free grammar for pattern detection in protein sequences , 2007 .

[106]  R. Durbin,et al.  RNA sequence analysis using covariance models. , 1994, Nucleic acids research.

[107]  Andreas Stolcke,et al.  An Efficient Probabilistic Context-Free Parsing Algorithm that Computes Prefix Probabilities , 1994, CL.

[108]  J. Skolnick,et al.  Automated structure prediction of weakly homologous proteins on a genomic scale. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[109]  Andrew D. Smith,et al.  A Transition Probability Model for Amino Acid Substitutions from Blocks , 2003, J. Comput. Biol..

[110]  Tadao Kasami,et al.  An Efficient Recognition and Syntax-Analysis Algorithm for Context-Free Languages , 1965 .

[111]  Zsuzsanna Dosztányi,et al.  PDB_TM: selection and membrane localization of transmembrane proteins in the protein data bank , 2004, Nucleic Acids Res..

[112]  Timothy Nugent,et al.  Accurate de novo structure prediction of large transmembrane protein domains using fragment-assembly and correlated mutation analysis , 2012, Proceedings of the National Academy of Sciences.

[113]  Aravind K. Joshi,et al.  Computational linguistics: A new tool for exploring biopolymer structures and statistical mechanics , 2007 .

[114]  Jacques Nicolas,et al.  Locally Substitutable Languages for Enhanced Inductive Leaps , 2012, ICGI.

[115]  Wei Liu,et al.  Comparison of helix interactions in membrane and soluble alpha-bundle proteins. , 2002, Biophysical journal.

[116]  Alexander S. Rose,et al.  MPlot—a server to analyze and visualize tertiary structure contacts and geometrical features of helical membrane proteins , 2010, Nucleic Acids Res..

[117]  R. Casadio,et al.  A neural network based predictor of residue contacts in proteins. , 1999, Protein engineering.

[118]  M. A. Jiménez-Montaño,et al.  On the syntactic structure of protein sequences and the concept of grammar complexity , 1984 .

[119]  Jeffrey Skolnick,et al.  TASSER_WT: a protein structure prediction algorithm with accurate predicted contact restraints for difficult protein targets. , 2010, Biophysical journal.

[120]  J. Skolnick,et al.  MONSSTER: a method for folding globular proteins with a small number of distance restraints. , 1997, Journal of molecular biology.

[121]  Mario Gimona,et al.  Protein linguistics — a grammar for modular protein assembly? , 2006, Nature Reviews Molecular Cell Biology.

[122]  D A Parry,et al.  Alpha-helical coiled coils and bundles: how to design an alpha-helical protein. , 1990, Proteins: Structure, Function, and Bioinformatics.

[123]  T. Hwa,et al.  Identification of direct residue contacts in protein–protein interaction by message passing , 2009, Proceedings of the National Academy of Sciences.

[124]  Andreas Stolcke,et al.  Bayesian learning of probabilistic language models , 1994 .

[125]  Sarel J Fleishman,et al.  A novel scoring function for predicting the conformations of tightly packed pairs of transmembrane alpha-helices. , 2002, Journal of molecular biology.

[126]  Dmitrij Frishman,et al.  Co-evolving residues in membrane proteins , 2007, Bioinform..

[127]  D. Parry,et al.  α‐Helical coiled coils and bundles: How to design an α‐helical protein , 1990 .

[129]  David Baker,et al.  Protein Structure Prediction Using Rosetta , 2004, Numerical Computer Methods, Part D.

[130]  Alfonso Valencia,et al.  Assessment of intramolecular contact predictions for CASP7 , 2007, Proteins.

[131]  Uwe Reyle,et al.  Natural Language Parsing and Linguistic Theories , 1988 .

[132]  M Vendruscolo,et al.  Recovery of protein structure from contact maps. , 1997, Folding & design.

[133]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[134]  M. P. Cummings PHYLIP (Phylogeny Inference Package) , 2004 .

[135]  Satoshi Kobayashi,et al.  Learning Local Languages and Their Application to DNA Sequence Analysis , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[136]  V. Brendel,et al.  Genome structure described by formal languages. , 1984, Nucleic acids research.

[137]  Hiroyuki Ogata,et al.  AAindex: Amino Acid Index Database , 1999, Nucleic Acids Res..

[138]  Pierre Baldi,et al.  Improved residue contact prediction using support vector machines and a large feature set , 2007, BMC Bioinformatics.

[139]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[140]  Jay Earley,et al.  An efficient context-free parsing algorithm , 1970, Commun. ACM.

[141]  Srinivas Devadas,et al.  Modeling ensembles of transmembrane beta-barrel proteins. , 2008, Proteins.

[142]  Yang Zhang,et al.  Application of sparse NMR restraints to large-scale protein structure prediction. , 2004, Biophysical journal.

[143]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[144]  Naoki Abe,et al.  Prediction of Beta-Sheet Structures Using Stochastic Tree Grammars , 1994 .

[145]  S. O. Smith,et al.  Helix packing in polytopic membrane proteins: role of glycine in transmembrane helix association. , 1999, Biophysical journal.

[146]  Sean R. Eddy,et al.  Pfam: multiple sequence alignments and HMM-profiles of protein domains , 1998, Nucleic Acids Res..

[147]  Steve Young,et al.  Applications of stochastic context-free grammars using the Inside-Outside algorithm , 1990 .

[148]  Alaa A. Kharbouch,et al.  Three models for the description of language , 1956, IRE Trans. Inf. Theory.

[149]  M. Kanehisa,et al.  Cluster analysis of amino acid indices for prediction of protein structure and function. , 1988, Protein engineering.

[150]  Srinivas Devadas,et al.  Modeling ensembles of transmembrane β‐barrel proteins , 2008 .

[151]  J. Baker Trainable grammars for speech recognition , 1979 .