Automatic classification of protein structures using physicochemical parameters

Protein classification is the first step to functional annotation; SCOP and Pfam databases are currently the most relevant protein classification schemes. However, the disproportion in the number of three dimensional (3D) protein structures generated versus their classification into relevant superfamilies/families emphasizes the need for automated classification schemes. Predicting function of novel proteins based on sequence information alone has proven to be a major challenge.The present study focuses on the use of physicochemical parameters in conjunction with machine learning algorithms (Naive Bayes, Decision Trees, Random Forest and Support Vector Machines) to classify proteins into their respective SCOP superfamily/Pfam family, using sequence derived information. Spectrophores™, a 1D descriptor of the 3D molecular field surrounding a structure was used as a benchmark to compare the performance of the physicochemical parameters. The machine learning algorithms were modified to select features based on information gain for each SCOP superfamily/Pfam family. The effect of combining physicochemical parameters and spectrophores on classification accuracy (CA) was studied.Machine learning algorithms trained with the physicochemical parameters consistently classified SCOP superfamilies and Pfam families with a classification accuracy above 90%, while spectrophores performed with a CA of around 85%. Feature selection improved classification accuracy for both physicochemical parameters and spectrophores based machine learning algorithms. Combining both attributes resulted in a marginal loss of performance. Physicochemical parameters were able to classify proteins from both schemes with classification accuracy ranging from 90–96%. These results suggest the usefulness of this method in classifying proteins from amino acid sequences.

[1]  David A. Landgrebe,et al.  A survey of decision tree classifier methodology , 1991, IEEE Trans. Syst. Man Cybern..

[2]  Bernhard Pfahringer,et al.  Locally Weighted Naive Bayes , 2002, UAI.

[3]  R. Nussinov,et al.  Protein–protein interactions: Structurally conserved residues distinguish between binding sites and exposed protein surfaces , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Frank A. Momany,et al.  Determination of partial atomic charges from ab initio molecular electrostatic potentials. Application to formamide, methanol, and formic acid , 1978 .

[5]  Henry Soldano,et al.  Automatic classification of protein structures relying on similarities between alignments , 2012, BMC Bioinformatics.

[6]  Sridhar Hariharaputran,et al.  Rebelling for a Reason: Protein Structural “Outliers” , 2013, PloS one.

[7]  D. Hand,et al.  Idiot's Bayes—Not So Stupid After All? , 2001 .

[8]  P. Røgen,et al.  Automatic classification of protein structure by using Gauss integrals , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Cathy H. Wu,et al.  Protein family classification and functional annotation , 2003, Comput. Biol. Chem..

[10]  M Nilges,et al.  Functional diversity of PH domains: an exhaustive modelling study. , 1997, Folding & design.

[11]  G. Gonnet,et al.  Exhaustive matching of the entire protein sequence database. , 1992, Science.

[12]  Zhiyong Lu,et al.  Predicting subcellular localization of proteins using machine-learned classifiers , 2004, Bioinform..

[13]  Ge Xia,et al.  New enumeration algorithm for protein structure comparison and classification , 2013, BMC Genomics.

[14]  Gajendra P. S. Raghava,et al.  ESLpred: SVM-based method for subcellular localization of eukaryotic proteins using dipeptide composition and PSI-BLAST , 2004, Nucleic Acids Res..

[15]  Blaz Zupan,et al.  Orange: From Experimental Machine Learning to Interactive Data Mining , 2004, PKDD.

[16]  Chris Sander,et al.  The FSSP database: fold classification based on structure-structure alignment of proteins , 1996, Nucleic Acids Res..

[17]  W. Pearson Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms. , 1991, Genomics.

[18]  Kengo Kinoshita,et al.  Effects of surface‐to‐volume ratio of proteins on hydrophilic residues: Decrease in occurrence and increase in buried fraction , 2008, Protein science : a publication of the Protein Society.

[19]  Gang Wang,et al.  Feature selection with conditional mutual information maximin in text categorization , 2004, CIKM '04.

[20]  Michel Waroquier,et al.  The Electronegativity Equalization Method I: Parametrization and Validation for Atomic Charge Calculations , 2002 .

[21]  James A. Casbon,et al.  Functional diversity within protein superfamilies , 2006, J. Integr. Bioinform..

[22]  Pooja Jain,et al.  Automatic structure classification of small proteins using random forest , 2010, BMC Bioinformatics.

[23]  G. von Heijne,et al.  Membrane protein structure: prediction versus reality. , 2007, Annual review of biochemistry.

[24]  U. Hobohm,et al.  A sequence property approach to searching protein databases. , 1995, Journal of molecular biology.

[25]  S. Henikoff,et al.  Amino acid substitution matrices. , 2000, Advances in protein chemistry.

[26]  S Collin,et al.  Molecular lipophilicity potential by CLIP, a reliable tool for the description of the 3D distribution of lipophilicity: application to 3-phenyloxazolidin-2-one, a prototype series of reversible MAOA inhibitors. , 1998, Bioorganic & medicinal chemistry letters.

[27]  I. Longden,et al.  EMBOSS: the European Molecular Biology Open Software Suite. , 2000, Trends in genetics : TIG.

[28]  Soo-Young Lee,et al.  Efficient feature selection based on information gain criterion for face recognition , 2007, 2007 International Conference on Information Acquisition.

[29]  Susan Idicula-Thomas,et al.  Understanding the relationship between the primary structure of proteins and its propensity to be soluble on overexpression in Escherichia coli , 2005, Protein science : a publication of the Protein Society.

[30]  Gert Thijs,et al.  Application of spectrophores™ to map vendor chemical space using self-organising maps , 2011, J. Cheminformatics.

[31]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[32]  Michael A. Erdmann,et al.  Protein Similarity from Knot Theory: Geometric Convolution and Line Weavings , 2005, J. Comput. Biol..

[33]  K. Dill,et al.  Transition states and folding dynamics of proteins and heteropolymers , 1994 .

[34]  G. Crippen,et al.  Prediction of Physicochemical Parameters by Atomic Contributions. , 1999 .

[35]  Noel M. O'Boyle,et al.  De novo design of molecular wires with optimal properties for solar energy conversion , 2011, Journal of Cheminformatics.

[36]  Donato Malerba,et al.  A Comparative Analysis of Methods for Pruning Decision Trees , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[37]  Sharmila Anishetty,et al.  Global metal-ion Binding protein Fingerprint: a Method to Identify Motif-Less metal-ion Binding proteins , 2010, J. Bioinform. Comput. Biol..

[38]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[39]  X.-D. Sun,et al.  Prediction of protein structural classes using support vector machines , 2006, Amino Acids.

[40]  Hans-Peter Kriegel,et al.  Nearest Neighbor Classification in 3D Protein Databases , 1999, ISMB.

[41]  Frederick Livingston,et al.  Implementation of Breiman's Random Forest Machine Learning Algorithm , 2005 .

[42]  A Ikai,et al.  Thermostability and aliphatic index of globular proteins. , 1980, Journal of biochemistry.

[43]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[44]  Jignesh M. Patel,et al.  A framework for protein structure classification and identification of novel protein structures , 2006, BMC Bioinformatics.

[45]  F. Dyda,et al.  GCN5-related N-acetyltransferases: a structural overview. , 2000, Annual review of biophysics and biomolecular structure.

[46]  Juwen Shen,et al.  Predicting protein–protein interactions based only on sequences information , 2007, Proceedings of the National Academy of Sciences.

[47]  Olivier Taboureau,et al.  Classification of Cytochrome P450 1A2 Inhibitors and Noninhibitors by Machine Learning Techniques , 2009, Drug Metabolism and Disposition.

[48]  Johannes Söding,et al.  The HHpred interactive server for protein homology detection and structure prediction , 2005, Nucleic Acids Res..