Support vector machine based classification of 3-dimensional protein physicochemical environments for automated function annotation

The knowledge of protein functions as well as structures is critical for drug discovery and development. The FEATURE system developed at Stanford is an effective tool for characterizing and classifying local environments in proteins. FEATURE utilizes vectors of a fixed dimension to represent the physicochemical properties around a residue. Functional sites and non-sites are identified by classifying such vectors using the Naïve Bayes classifier. In this paper, we improve the FEATURE framework in several ways so that it can be more flexible, robust and accurate. The new tool can handle vectors of a user-specified dimension and can suppress noise effectively, with little loss of important signals, by employing dimensionality reduction. Furthermore, our approach utilizes the support vector machine for a more accurate classification. According to the results of our thorough experiments, the proposed new approach outperformed the original tool by 20.13% and 13.42% with respect to true and false positive rates, respectively.

[1]  Itay Mayrose,et al.  ConSurf 2005: the projection of evolutionary conservation scores of residues on protein structures , 2005, Nucleic Acids Res..

[2]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[3]  Ian Sillitoe,et al.  FLORA: A Novel Method to Predict Protein Function from Structure in Diverse Superfamilies , 2009, PLoS Comput. Biol..

[4]  Giovanni De Micheli,et al.  Clustering protein environments for function prediction: finding PROSITE motifs in 3D , 2007, BMC Bioinformatics.

[5]  Leonard E. Trigg,et al.  Technical Note: Naive Bayes for Regression , 2000, Machine Learning.

[6]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[7]  Leonard E. Trigg,et al.  Naive Bayes for regression , 1998 .

[8]  Amos Bairoch,et al.  The PROSITE database , 2005, Nucleic Acids Res..

[9]  D. Massart,et al.  The Mahalanobis distance , 2000 .

[10]  M. Jambon,et al.  A new bioinformatic approach to detect common 3D sites in protein structures , 2003, Proteins.

[11]  Bernard R. Rosner,et al.  Fundamentals of Biostatistics. , 1992 .

[12]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[13]  Janet M. Thornton,et al.  An algorithm for constraint-based structural template matching: application to 3D templates with statistical analysis , 2003, Bioinform..

[14]  Pedro M. Domingos,et al.  Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier , 1996, ICML.

[15]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[16]  J. Skolnick,et al.  Enhanced functional annotation of protein sequences via the use of structural descriptors. , 2001, Journal of structural biology.

[17]  Iddo Friedberg,et al.  Automated protein function predictionçthe genomic challenge , 2006 .

[18]  Russ B. Altman,et al.  WebFEATURE: an interactive web tool for identifying and visualizing functional sites on macromolecular structures , 2003, Nucleic Acids Res..