Enhancing protein-vitamin binding residues prediction by multiple heterogeneous subspace SVMs ensemble

BackgroundVitamins are typical ligands that play critical roles in various metabolic processes. The accurate identification of the vitamin-binding residues solely based on a protein sequence is of significant importance for the functional annotation of proteins, especially in the post-genomic era, when large volumes of protein sequences are accumulating quickly without being functionally annotated.ResultsIn this paper, a new predictor called TargetVita is designed and implemented for predicting protein-vitamin binding residues using protein sequences. In TargetVita, features derived from the position-specific scoring matrix (PSSM), predicted protein secondary structure, and vitamin binding propensity are combined to form the original feature space; then, several feature subspaces are selected by performing different feature selection methods. Finally, based on the selected feature subspaces, heterogeneous SVMs are trained and then ensembled for performing prediction.ConclusionsThe experimental results obtained with four separate vitamin-binding benchmark datasets demonstrate that the proposed TargetVita is superior to the state-of-the-art vitamin-specific predictor, and an average improvement of 10% in terms of the Matthews correlation coefficient (MCC) was achieved over independent validation tests. The TargetVita web server and the datasets used are freely available for academic use at http://csbio.njust.edu.cn/bioinf/TargetVita or http://www.csbio.sjtu.edu.cn/bioinf/TargetVita.

[1]  R. Wade,et al.  Computational approaches to identifying and characterizing protein binding sites for ligand design , 2009, Journal of molecular recognition : JMR.

[2]  David G. Stork,et al.  Pattern classification, 2nd Edition , 2000 .

[3]  K. Chou Some remarks on protein attribute prediction and pseudo amino acid composition , 2010, Journal of Theoretical Biology.

[4]  X. Barril,et al.  Understanding and predicting druggability. A high-throughput method for detection of drug binding sites. , 2010, Journal of medicinal chemistry.

[5]  Bart De Moor,et al.  Predicting the prognosis of breast cancer by integrating clinical and microarray data with Bayesian networks , 2006, ISMB.

[6]  M Hendlich,et al.  LIGSITE: automatic and efficient detection of potential small molecule-binding sites in proteins. , 1997, Journal of molecular graphics & modelling.

[7]  Lukasz Kurgan,et al.  ATPsite: sequence-based prediction of ATP-binding residues , 2011, Proteome Science.

[8]  Kuo-Chen Chou,et al.  Classification and Analysis of Regulatory Pathways Using Graph Property, Biochemical and Physicochemical Property, and Functional Property , 2011, PloS one.

[9]  T. Sterling Publication Decisions and their Possible Effects on Inferences Drawn from Tests of Significance—or Vice Versa , 1959 .

[10]  Michael E Webb,et al.  Roles of vitamins B5, B8, B9, B12 and molybdenum cofactor at cellular and organismal levels. , 2007, Natural product reports.

[11]  Vincent Le Guilloux,et al.  Fpocket: An open source platform for ligand pocket detection , 2009, BMC Bioinformatics.

[12]  Concha Bielza,et al.  Machine Learning in Bioinformatics , 2008, Encyclopedia of Database Systems.

[13]  Richard M. Jackson,et al.  Q-SiteFinder: an energy-based method for the prediction of protein-ligand binding sites , 2005, Bioinform..

[14]  Jun Hu,et al.  Designing Template-Free Predictor for Targeting Protein-Ligand Binding Sites with Classifier Ensemble and Spatial Clustering , 2013, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[15]  Lukasz A. Kurgan,et al.  Prediction and analysis of nucleotide-binding residues using sequence and sequence-derived structural descriptors , 2012, Bioinform..

[16]  Pedro Alexandrino Fernandes,et al.  Protein–ligand docking: Current status and future challenges , 2006, Proteins.

[17]  Jian Yang,et al.  Joint Laplacian feature weights learning , 2014, Pattern Recognit..

[18]  Jaime Prilusky,et al.  Automated analysis of interatomic contacts in proteins , 1999, Bioinform..

[19]  K. Chou,et al.  Prediction of protein structural classes. , 1995, Critical reviews in biochemistry and molecular biology.

[20]  Deng Cai,et al.  Laplacian Score for Feature Selection , 2005, NIPS.

[21]  Chih-Jen Lin,et al.  Working Set Selection Using Second Order Information for Training Support Vector Machines , 2005, J. Mach. Learn. Res..

[22]  F. Rébeillé,et al.  Elucidating Biosynthetic Pathways for Vitamins and Cofactors , 2008 .

[23]  R. Laskowski SURFNET: a program for visualizing molecular surfaces, cavities, and intermolecular interactions. , 1995, Journal of molecular graphics.

[24]  K. Chou,et al.  Cell-PLoc: a package of Web servers for predicting subcellular localization of proteins in various organisms , 2008, Nature Protocols.

[25]  Anne-Laure Boulesteix,et al.  Over-optimism in bioinformatics research , 2010, Bioinform..

[26]  BMC Bioinformatics , 2005 .

[27]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[28]  Pietro Liò,et al.  Identification of DNA regulatory motifs using Bayesian variable selection , 2004, Bioinform..

[29]  Xiao Sun,et al.  Sequence-Based Prediction of DNA-Binding Residues in Proteins with Conservation and Correlation Information , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[30]  A. Emons,et al.  Boekbespreking: Molecular biology of the cell, B. Alberts, D. Bray, J. Lewis, M. Raff, K. Robers, D.J. Watson. Garland Publ., New York. 1989. , 1990 .

[31]  Thomas L. Madden,et al.  Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. , 2001, Nucleic acids research.

[32]  J. Skolnick,et al.  FINDSITE‐metal: Integrating evolutionary information and machine learning for structure‐based metal‐binding site prediction at the proteome level , 2011, Proteins.

[33]  Yang Zhang,et al.  BioLiP: a semi-manually curated database for biologically relevant ligand–protein interactions , 2012, Nucleic Acids Res..

[34]  Constantin F. Aliferis,et al.  A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis , 2004, Bioinform..

[35]  C. Sander,et al.  A method to predict functional residues in proteins , 1995, Nature Structural Biology.

[36]  Thomas Dick,et al.  Vitamin B6 biosynthesis is essential for survival and virulence of Mycobacterium tuberculosis , 2010, Molecular microbiology.

[37]  Jie Liang,et al.  CASTp: computed atlas of surface topography of proteins with structural and topographical mapping of functionally annotated residues , 2006, Nucleic Acids Res..

[38]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[39]  Bruce A. Fenderson,et al.  Molecular Biology of the Cell,5th Edition , 2008 .

[40]  Pedro Larrañaga,et al.  Machine learning: an indispensable tool in bioinformatics. , 2010, Methods in molecular biology.

[41]  Jeffrey Skolnick,et al.  The distribution of ligand-binding pockets around protein-protein interfaces suggests a general mechanism for pocket formation , 2012, Proceedings of the National Academy of Sciences.

[42]  I. Song,et al.  Working Set Selection Using Second Order Information for Training Svm, " Complexity-reduced Scheme for Feature Extraction with Linear Discriminant Analysis , 2022 .

[43]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[44]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[45]  Yuko Okamoto,et al.  Ab Initio prediction of protein–ligand binding structures by replica‐exchange umbrella sampling simulations , 2011, J. Comput. Chem..

[46]  M. Michael Gromiha,et al.  Development of RNA Stiffness Parameters and Analysis on Protein-RNA Binding Specificity: Comparison with DNA , 2012 .

[47]  K. Chou,et al.  Cell-PLoc 2.0: an improved package of web-servers for predicting subcellular localization of proteins in various organisms , 2010 .

[48]  V. Sobolev,et al.  Prediction of transition metal‐binding sites from apo protein structures , 2007, Proteins.

[49]  Yvan Saeys,et al.  Feature selection for splice site prediction: A new method using EDA-based feature ranking , 2004, BMC Bioinformatics.

[50]  Leif E. Peterson,et al.  Machine learning in biomedicine and bioinformatics. , 2009, International journal of data mining and bioinformatics.

[51]  Jun Hu,et al.  TargetATPsite: A template‐free method for ATP‐binding sites prediction with residue evolution image sparse representation and classifier ensemble , 2013, J. Comput. Chem..

[52]  David G. Stork,et al.  Pattern Classification , 1973 .

[53]  Stefan Günther,et al.  SuperSite: dictionary of metabolite and drug binding sites in proteins , 2008, Nucleic Acids Res..

[54]  Zhenmin Tang,et al.  Enhancing Membrane Protein Subcellular Localization Prediction by Parallel Fusion of Multi-View Features , 2012, IEEE Transactions on NanoBioscience.

[55]  Jiangning Song,et al.  Improving the accuracy of predicting disulfide connectivity by feature selection , 2010, J. Comput. Chem..

[56]  Adeel Malik,et al.  Residue propensities, discrimination and binding site prediction of adenine and guanine phosphates , 2011, BMC Biochemistry.

[57]  Jian Yang,et al.  Improving protein-ATP binding residues prediction by boosting SVMs with random under-sampling , 2013, Neurocomputing.

[58]  B. Alberts,et al.  Molecular Biology of the Cell 4th edition , 2007 .

[59]  Michael Schroeder,et al.  MetaDBSite: a meta approach to improve protein DNA-binding sites prediction , 2011, BMC Systems Biology.

[60]  A. Sali,et al.  Comparative protein structure modeling of genes and genomes. , 2000, Annual review of biophysics and biomolecular structure.

[61]  Leif E. Peterson,et al.  Logistic ensembles of Random Spherical Linear Oracles for microarray classification , 2009, Int. J. Data Min. Bioinform..

[62]  Louette R. Johnson Lutjens Research , 2006 .

[63]  J. Davies,et al.  Molecular Biology of the Cell , 1983, Bristol Medico-Chirurgical Journal.

[64]  D. Levitt,et al.  POCKET: a computer graphics method for identifying and displaying protein cavities and their surrounding amino acids. , 1992, Journal of molecular graphics.

[65]  Dario Ghersi,et al.  SITEHOUND-web: a server for ligand binding site identification in protein structures , 2009, Nucleic Acids Res..

[66]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[67]  Yan Huang,et al.  Predicting protein-ATP binding sites from primary sequence through fusing bi-profile sampling of multi-view features , 2012, BMC Bioinformatics.

[68]  Jonathan Knowles,et al.  A guide to drug discovery: Target selection in drug discovery , 2003, Nature Reviews Drug Discovery.

[69]  Gajendra P. S. Raghava,et al.  Prediction of vitamin interacting residues in a vitamin binding protein using evolutionary information , 2013, BMC Bioinformatics.

[70]  Javier De Las Rivas,et al.  Protein–Protein Interactions Essentials: Key Concepts to Building and Analyzing Interactome Networks , 2010, PLoS Comput. Biol..

[71]  LarrañagaPedro,et al.  A review of feature selection techniques in bioinformatics , 2007 .