Hot spot prediction in protein-protein interactions by an ensemble system

BackgroundHot spot residues are functional sites in protein interaction interfaces. The identification of hot spot residues is time-consuming and laborious using experimental methods. In order to address the issue, many computational methods have been developed to predict hot spot residues. Moreover, most prediction methods are based on structural features, sequence characteristics, and/or other protein features.ResultsThis paper proposed an ensemble learning method to predict hot spot residues that only uses sequence features and the relative accessible surface area of amino acid sequences. In this work, a novel feature selection technique was developed, an auto-correlation function combined with a sliding window technique was applied to obtain the characteristics of amino acid residues in protein sequence, and an ensemble classifier with SVM and KNN base classifiers was built to achieve the best classification performance.ConclusionThe experimental results showed that our model yields the highest F1 score of 0.92 and an MCC value of 0.87 on ASEdb dataset. Compared with other machine learning methods, our model achieves a big improvement in hot spot prediction.Availabilityhttp://deeplearner.ahu.edu.cn/web/HotspotEL.htm.

[1]  Yangyang Wang,et al.  In Silico Prediction of Drug-Induced Liver Injury Based on Ensemble Classifier Method , 2019, International journal of molecular sciences.

[2]  Juan Fernández-Recio,et al.  Modeling Binding Affinity of Pathological Mutations for Computational Protein Design. , 2017, Methods in molecular biology.

[3]  Jiangning Song,et al.  Co-Occurring Atomic Contacts for the Characterization of Protein Binding Hot Spots , 2015, PloS one.

[4]  Kurt S. Thorn,et al.  ASEdb: a database of alanine mutations and their effects on the free energy of binding in protein interactions , 2001, Bioinform..

[5]  Jitendra Malik,et al.  SVM-KNN: Discriminative Nearest Neighbor Classification for Visual Category Recognition , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[6]  Jinyan Li,et al.  Accurate prediction of hot spot residues through physicochemical characteristics of amino acid sequences , 2013, Proteins.

[7]  Nikunj C. Oza,et al.  Online Ensemble Learning , 2000, AAAI/IAAI.

[8]  Zhang Yanpin Protein Structure Class Prediction Based on Autocorrelation Coefficient and PseAAC , 2014 .

[9]  B. Liu,et al.  Pse-Analysis: a python package for DNA/RNA and protein/peptide sequence analysis based on pseudo components and kernel methods , 2017, Oncotarget.

[10]  D. Bailey,et al.  The Binding Interface Database (BID): A Compilation of Amino Acid Hot Spots in Protein Interfaces , 2003, Bioinform..

[11]  Yongli Bao,et al.  A compound‐based computational approach for the accurate determination of hot spots , 2013, Protein science : a publication of the Protein Society.

[12]  Junjie Chen,et al.  Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences , 2015, Nucleic Acids Res..

[13]  Ye Wang,et al.  Semi-supervised prediction of protein interaction sites from unlabeled sample information , 2019, BMC Bioinformatics.

[14]  Zixiang Wang,et al.  Computational identification of binding energy hot spots in protein–RNA complexes using an ensemble approach , 2018, Bioinform..

[15]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[16]  Samy O Meroueh,et al.  A Computational Investigation of Small-Molecule Engagement of Hot Spots at Protein-Protein Interaction Interfaces , 2017, J. Chem. Inf. Model..

[17]  Gavin Brown,et al.  Ensemble Learning , 2010, Encyclopedia of Machine Learning and Data Mining.

[18]  R. L. Jernigan,et al.  Distributions of amino acids suggest that certain residue types more effectively determine protein secondary structure , 2013, Journal of Molecular Modeling.

[19]  Chengxin Zhang,et al.  PyMod 2.0: improvements in protein sequence‐structure analysis and homology modeling within PyMOL , 2016, Bioinform..

[20]  Juan Fernández-Recio,et al.  SKEMPI: a Structural Kinetic and Energetic database of Mutant Protein Interactions and its use in empirical models , 2012, Bioinform..

[21]  R. Romero,et al.  A Linear-RBF Multikernel SVM to Classify Big Text Corpora , 2015, BioMed research international.

[22]  S.-W. Zhang,et al.  Prediction of protein homo-oligomer types by pseudo amino acid composition: Approached with an improved feature extraction and Naive Bayes Feature Fusion , 2006, Amino Acids.

[23]  Y. Zhang,et al.  Prediction of eukaryotic protein subcellular multi- localisation with a combined KNN-SVM ensemble classifier , 2011 .

[24]  J. Wells,et al.  Systematic mutational analyses of protein-protein interfaces. , 1991, Methods in enzymology.

[25]  Jinyan Li,et al.  Integrating water exclusion theory into βcontacts to predict binding free energy changes and binding hot spots , 2013, BMC Bioinformatics.

[26]  Tomonori Gotoh,et al.  Secondary Structure Characterization Based on Amino Acid Composition and Availability in Proteins , 2010, J. Chem. Inf. Model..

[27]  Yaxin Bi,et al.  KNN Model-Based Approach in Classification , 2003, OTM.

[28]  Jinyan Li,et al.  Protein binding hot spots prediction from sequence only by a new ensemble learning method , 2017, Amino Acids.

[29]  T. Petersen,et al.  A generic method for assignment of reliability scores applied to solvent accessibility predictions , 2009, BMC Structural Biology.

[30]  Peter Uetz,et al.  Bacterial protein meta-interactomes predict cross-species interactions and protein function , 2017, BMC Bioinformatics.

[31]  David Baker,et al.  Protein structure prediction and analysis using the Robetta server , 2004, Nucleic Acids Res..

[32]  Menglong Li,et al.  Prediction of hot spots residues in protein–protein interface using network feature and microenvironment feature , 2014 .

[33]  Geoffrey I. Webb,et al.  iFeature: a Python package and web server for features extraction and selection from protein and peptide sequences , 2018, Bioinform..

[34]  Junfeng Xia,et al.  Predicting hot spots in protein interfaces based on protrusion index, pseudo hydrophobicity and electron-ion interaction pseudopotential features , 2016, Oncotarget.

[35]  Minoru Kanehisa,et al.  AAindex: amino acid index database, progress report 2008 , 2007, Nucleic Acids Res..

[36]  Sarah A. Teichmann,et al.  Relative Solvent Accessible Surface Area Predicts Protein Conformational Changes upon Binding , 2011, Structure.

[37]  Shuigeng Zhou,et al.  Boosting Prediction Performance of Protein-Protein Interaction Hot Spots by Using Structural Neighborhood Properties , 2013, J. Comput. Biol..

[38]  Ozlem Keskin,et al.  HotPoint: hot spot prediction server for protein interfaces , 2010, Nucleic Acids Res..

[39]  Jinyan Li,et al.  A Sequence-Based Dynamic Ensemble Learning System for Protein Ligand-Binding Site Prediction , 2016, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[40]  Jinyan Li,et al.  dbMPIKT: a database of kinetic and thermodynamic mutant protein interactions , 2018, BMC Bioinformatics.

[41]  Lawrence Hubert,et al.  Data Analysis by Single-Link and Complete-Link Hierarchical Clustering , 1976 .

[42]  Xing-Ming Zhao,et al.  APIS: accurate prediction of hot spots in protein interfaces by combining protrusion index with solvent accessibility , 2010, BMC Bioinformatics.

[43]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[44]  Guoyao Wu,et al.  Control of seizures by ketogenic diet-induced modulation of metabolic pathways , 2016, Amino Acids.

[45]  Ozlem Keskin,et al.  Identification of computational hot spots in protein interfaces: combining solvent accessibility and inter-residue potentials improves the accuracy , 2009, Bioinform..

[46]  M. Dwyer,et al.  Peptide exosite inhibitors of factor VIIa as anticoagulants , 2000, Nature.

[47]  Geoffrey I. Webb,et al.  Encyclopedia of Machine Learning and Data Mining , 2017, Encyclopedia of Machine Learning and Data Mining.