Sequence-Based Prediction of Protein-Carbohydrate Binding Sites Using Support Vector Machines

Carbohydrate-binding proteins play significant roles in many diseases including cancer. Here, we established a machine-learning-based method (called sequence-based prediction of residue-level interaction sites of carbohydrates, SPRINT-CBH) to predict carbohydrate-binding sites in proteins using support vector machines (SVMs). We found that integrating evolution-derived sequence profiles with additional information on sequence and predicted solvent accessible surface area leads to a reasonably accurate, robust, and predictive method, with area under receiver operating characteristic curve (AUC) of 0.78 and 0.77 and Matthew's correlation coefficient of 0.34 and 0.29, respectively for 10-fold cross validation and independent test without balancing binding and nonbinding residues. The quality of the method is further demonstrated by having statistically significantly more binding residues predicted for carbohydrate-binding proteins than presumptive nonbinding proteins in the human proteome, and by the bias of rare alleles toward predicted carbohydrate-binding sites for nonsynonymous mutations from the 1000 genome project. SPRINT-CBH is available as an online server at http://sparks-lab.org/server/SPRINT-CBH .

[1]  The Uniprot Consortium,et al.  UniProt: a hub for protein information , 2014, Nucleic Acids Res..

[2]  Yue-Shi Lee,et al.  Under-Sampling Approaches for Improving Prediction of the Minority Class in an Imbalanced Dataset , 2006 .

[3]  Amika Sood,et al.  Genetically encoded fragment-based discovery of glycopeptide ligands for carbohydrate-binding proteins. , 2015, Journal of the American Chemical Society.

[4]  Alan Wee-Chung Liew,et al.  Sequence‐based prediction of protein–peptide binding sites using support vector machine , 2016, J. Comput. Chem..

[5]  Yuedong Yang,et al.  DDIG-in: discriminating between disease-associated and neutral non-frameshifting micro-indels , 2013, Genome Biology.

[6]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[7]  Yaoqi Zhou,et al.  SPOT‐Ligand: Fast and effective structure‐based virtual screening by binding homology search according to ligand and receptor similarity , 2016, J. Comput. Chem..

[8]  Hiroshi Itagaki,et al.  Using Neural Network , 2014 .

[9]  Michael Schroeder,et al.  MetaDBSite: a meta approach to improve protein DNA-binding sites prediction , 2011, BMC Systems Biology.

[10]  Nitish Kumar Mishra,et al.  Identification of Mannose Interacting Residues Using Local Composition , 2011, PloS one.

[11]  E. DeLong,et al.  Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. , 1988, Biometrics.

[12]  Zhichao Miao,et al.  Prediction of nucleic acid binding probability in proteins: a neighboring residue network based score , 2015, Nucleic acids research.

[13]  P. Ng,et al.  Predicting the effects of frameshifting indels , 2012, Genome Biology.

[14]  Hong Yan,et al.  A discriminatory function for prediction of protein-DNA interactions based on alpha shape modeling , 2010, Bioinform..

[15]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[16]  Yaoqi Zhou,et al.  Carbohydrate‐binding protein identification by coupling structural similarity searching with binding affinity prediction , 2014, J. Comput. Chem..

[17]  Tuo Zhang,et al.  Analysis and prediction of RNA-binding residues using sequence, evolutionary conservation, and predicted secondary structure and solvent accessibility. , 2010, Current protein & peptide science.

[18]  Chien-Yu Chen,et al.  PiDNA: predicting protein–DNA interactions with structural models , 2013, Nucleic Acids Res..

[19]  James G. Lyons,et al.  Improving prediction of secondary structure, local backbone angles, and solvent accessible surface area of proteins by iterative deep learning , 2015, Scientific Reports.

[20]  J M Thornton,et al.  Analysis and prediction of carbohydrate binding sites. , 2000, Protein engineering.

[21]  M. Higgins,et al.  Carbohydrate binding molecules in malaria pathology. , 2010, Current opinion in structural biology.

[22]  Jianhua Ruan,et al.  A novel link prediction algorithm for reconstructing protein-protein interaction networks by topological similarity , 2013, Bioinform..

[23]  Zhiping Weng,et al.  ZDOCK server: interactive docking prediction of protein-protein complexes and symmetric multimers , 2014, Bioinform..

[24]  Wen-Lian Hsu,et al.  Prediction of Carbohydrate Binding Sites on Protein Surfaces with 3-Dimensional Probability Density Distributions of Interacting Atoms , 2012, PloS one.

[25]  Serge Pérez,et al.  Glyco3D: a portal for structural glycosciences. , 2015, Methods in molecular biology.

[26]  Shandar Ahmad,et al.  PROCARB: A Database of Known and Modelled Carbohydrate-Binding Protein Structures with Sequence-Based Prediction Tools , 2010, Adv. Bioinformatics.

[27]  V. S. Rao,et al.  Protein-Protein Interaction Detection: Methods and Analysis , 2014, International journal of proteomics.

[28]  Yuedong Yang,et al.  Highly accurate and high-resolution function prediction of RNA binding proteins by fold recognition and binding affinity prediction , 2011, RNA biology.

[29]  Haruki Nakamura,et al.  PiRaNhA: a server for the computational prediction of RNA-binding residues in protein sequences , 2010, Nucleic Acids Res..

[30]  Dima Kozakov,et al.  Detection of peptide‐binding sites on protein surfaces: The first step toward the modeling and targeting of peptide‐mediated interactions , 2013, Proteins.

[31]  Jens Meiler,et al.  Generation and evaluation of dimension-reduced amino acid parameter representations by artificial neural networks , 2001 .

[32]  Eduardo Garcia Urdiales,et al.  Accurate Prediction of Peptide Binding Sites on Protein Surfaces , 2009, PLoS Comput. Biol..

[33]  Injae Shin,et al.  Carbohydrate microarrays: an advanced technology for functional studies of glycans. , 2006, Chemistry.

[34]  Song Liu,et al.  Protein binding site prediction using an empirical scoring function , 2006, Nucleic acids research.

[35]  Mahesh Kulharia,et al.  InCa-SiteFinder: a method for structure-based prediction of inositol and carbohydrate binding sites on proteins. , 2009, Journal of molecular graphics & modelling.

[36]  Shuigeng Zhou,et al.  Boosting Prediction Performance of Protein-Protein Interaction Hot Spots by Using Structural Neighborhood Properties - (Extended Abstract) , 2013, RECOMB.

[37]  Sukanta Mondal,et al.  MOWGLI: prediction of protein–MannOse interacting residues With ensemble classifiers usinG evoLutionary Information , 2016, Journal of biomolecular structure & dynamics.

[38]  Christopher J. Oldfield,et al.  Exploring the binding diversity of intrinsically disordered proteins involved in one‐to‐many binding , 2013, Protein science : a publication of the Protein Society.

[39]  J W Costerton,et al.  The bacterial glycocalyx in nature and disease. , 1981, Annual review of microbiology.

[40]  Jonathan D. Hirst,et al.  Prediction of glycosylation sites using random forests , 2008, BMC Bioinformatics.

[41]  Kentaro Shimizu,et al.  Automatic generation of bioinformatics tools for predicting protein–ligand binding sites , 2015, Bioinform..

[42]  Petety V Balaji,et al.  Identification of common structural features of binding sites in galactose‐specific proteins , 2004, Proteins.

[43]  Mark Johnson,et al.  NCBI BLAST: a better web interface , 2008, Nucleic Acids Res..

[44]  J. Balzarini,et al.  Potential of carbohydrate‐binding agents as therapeutics against enveloped viruses , 2010, Medicinal research reviews.

[45]  Yuedong Yang,et al.  Predicting DNA-Binding Proteins and Binding Residues by Complex Structure Prediction and Application to Human Proteome , 2014, PloS one.

[46]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[47]  Hassan Al-Ali,et al.  Prediction of protein‐glucose binding sites using support vector machines , 2009, Proteins.

[48]  S. Nakahara,et al.  Biological modulation by lectins and their ligands in tumor progression and metastasis. , 2008, Anti-cancer agents in medicinal chemistry.

[49]  Jianzhao Gao,et al.  An Accurate Method for Prediction of Protein-Ligand Binding Site on Protein Surface Using SVM and Statistical Depth Function , 2013, BioMed research international.

[50]  Mineichi Kudo,et al.  Comparison of algorithms that select features for pattern classifiers , 2000, Pattern Recognit..

[51]  Ashini Bolia,et al.  BP-Dock: A Flexible Docking Scheme for Exploring Protein-Ligand Interactions Based on Unbound Structures , 2014, J. Chem. Inf. Model..

[52]  A Keith Dunker,et al.  SPINE-D: Accurate Prediction of Short and Long Disordered Regions by a Single Neural-Network Based Method , 2012, Journal of biomolecular structure & dynamics.