Use of machine learning algorithms to classify binary protein sequences as highly-designable or poorly-designable

BackgroundBy using a standard Support Vector Machine (SVM) with a Sequential Minimal Optimization (SMO) method of training, Naïve Bayes and other machine learning algorithms we are able to distinguish between two classes of protein sequences: those folding to highly-designable conformations, or those folding to poorly- or non-designable conformations.ResultsFirst, we generate all possible compact lattice conformations for the specified shape (a hexagon or a triangle) on the 2D triangular lattice. Then we generate all possible binary hydrophobic/polar (H/P) sequences and by using a specified energy function, thread them through all of these compact conformations. If for a given sequence the lowest energy is obtained for a particular lattice conformation we assume that this sequence folds to that conformation. Highly-designable conformations have many H/P sequences folding to them, while poorly-designable conformations have few or no H/P sequences. We classify sequences as folding to either highly – or poorly-designable conformations. We have randomly selected subsets of the sequences belonging to highly-designable and poorly-designable conformations and used them to train several different standard machine learning algorithms.ConclusionBy using these machine learning algorithms with ten-fold cross-validation we are able to classify the two classes of sequences with high accuracy – in some cases exceeding 95%.

[1]  D. Klein,et al.  Compact self-avoiding circuits on two-dimensional lattices , 1984 .

[2]  John R. Anderson,et al.  MACHINE LEARNING An Artificial Intelligence Approach , 2009 .

[3]  Eugene I. Shakhnovich,et al.  Enumeration of all compact conformations of copolymers with random sequence of links , 1990 .

[4]  K. Dill,et al.  The effects of internal constraints on the configurations of chain molecules , 1990 .

[5]  J. D. Cloizeaux,et al.  Polymers in solution , 1990 .

[6]  K. Dill,et al.  Origins of structure in globular proteins. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[7]  D. Covell,et al.  Conformations of folded proteins in restricted spaces. , 1990, Biochemistry.

[8]  N. Madras,et al.  THE SELF-AVOIDING WALK , 2006 .

[9]  K. Dill,et al.  Inverse protein folding problem: designing polymer sequences. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[10]  E. Shakhnovich,et al.  Engineering of stable and fast-folding sequences of model proteins. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[11]  E. Shakhnovich,et al.  Proteins with selected sequences fold into unique native conformation. , 1994, Physical review letters.

[12]  E I Shakhnovich,et al.  Evolution-like selection of fast-folding model proteins. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Guttmann,et al.  Solvability of some statistical mechanical systems. , 1996, Physical review letters.

[14]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[15]  E I Shakhnovich,et al.  Modeling protein folding: the beauty and power of simplicity. , 1996, Folding & design.

[16]  N. Wingreen,et al.  Emergence of Preferred Structures in a Simple Model of Protein Folding , 1996, Science.

[17]  R. Jernigan,et al.  Efficient method to count and generate compact protein lattice conformations , 1997 .

[18]  R. Jernigan,et al.  Computer generation and enumeration of compact self-avoiding walks within simple geometries on lattices , 1997 .

[19]  N. Wingreen,et al.  NATURE OF DRIVING FORCE FOR PROTEIN FOLDING : A RESULT FROM ANALYZING THE STATISTICAL POTENTIAL , 1995, cond-mat/9512111.

[20]  V. Shahrezaei,et al.  STABILITY OF PREFERABLE STRUCTURES FOR A HYDROPHOBIC-POLAR MODEL OF PROTEIN FOLDING , 1998 .

[21]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[22]  James Philbin,et al.  Fast tree search for enumeration of a lattice model of protein folding , 2001, The Journal of Chemical Physics.

[23]  V. Shahrezaei,et al.  Highly designable protein structures and inter-monomer interactions , 1997, cond-mat/9710028.

[24]  R. Jernigan,et al.  TRANSFER MATRIX METHOD FOR ENUMERATION AND GENERATION OF COMPACT SELF-AVOIDING WALKS. II. CUBIC LATTICE , 1998 .

[25]  V. Shahrezaei,et al.  Protein ground state candidates in a simple model: an enumeration study. , 1999, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[26]  V. Shahrezaei,et al.  Geometrically Reduced Number of Protein Ground State Candidates , 1998, cond-mat/9811127.

[27]  G. Crippen Enumeration of cubic lattice walks by contact class , 2000 .

[28]  V. Shahrezaei,et al.  Geometry Selects Highly Designable Structures , 2000, cond-mat/0009256.

[29]  I. Jensen Enumeration of compact self-avoiding walks , 2001 .

[30]  Hue Sun Chan,et al.  Compact Polymers , 2001 .

[31]  Hao Li,et al.  Designability of protein structures: A lattice‐model study using the Miyazawa‐Jernigan matrix , 2002, Proteins.

[32]  Hao Li,et al.  Designability and thermal stability of protein structures , 2003, cond-mat/0303600.

[33]  Eugene I. Shakhnovich,et al.  Natural selection of more designable folds: A mechanism for thermophilic adaptation , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[34]  K. Holmberg,et al.  Polymers in Solution , 2003 .

[35]  Eric J. Deeds,et al.  Protein structure and evolutionary history determine sequence space topology. , 2004, Genome research.

[36]  Eugene I Shakhnovich,et al.  Physics and evolution of thermophilic adaptation. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[37]  Cristiano L. Dias,et al.  Designable structures are easy to unfold. , 2006, Physical review. E, Statistical, nonlinear, and soft matter physics.

[38]  M. Mansfield Unbiased sampling of lattice Hamilton path ensembles. , 2006, The Journal of chemical physics.

[39]  Igor N. Berezovsky,et al.  Positive and Negative Design in Stability and Thermal Adaptation of Natural Proteins , 2006, PLoS Comput. Biol..

[40]  R. Jernigan,et al.  Shape-dependent designability studies of lattice proteins , 2007, Journal of physics. Condensed matter : an Institute of Physics journal.

[41]  Taner Z Sen,et al.  Generation and enumeration of compact conformations on the two-dimensional triangular and three-dimensional fcc lattices. , 2007, The Journal of chemical physics.