Imbalance Data Processing Strategy for Protein Interaction Sites Prediction

Protein-protein interactions play essential roles in various biological progresses. Identifying protein interaction sites can facilitate researchers to understand life activities and therefore will be helpful for drug design. However, the number of experimental determined protein interaction sites is far less than that of protein sites in protein-protein interaction or protein complexes. Therefore, the negative and positive samples are usually imbalanced, which is common but bring result bias on the prediction of protein interaction sites by computational approaches. In this work, we presented three imbalance data processing strategies to reconstruct the original dataset, and then extracted protein features from the evolutionary conservation of amino acids to build a predictor for identification of protein interaction sites. On a dataset with 10,430 surface residues but only 2,299 interface residues, the imbalance dataset processing strategies can obviously reduce the prediction bias, and therefore improve the prediction performance of protein interaction sites. The experimental results show that our prediction models can achieve a better prediction performance, such as a prediction accuracy of 0.758, or a high F-measure of 0.737, which demonstrated the effectiveness of our method.

[1]  Zhiwei Ji,et al.  Molecular Skin Surface-Based Transformation Visualization between Biological Macromolecules , 2017, Journal of healthcare engineering.

[2]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[3]  Junfeng Xia,et al.  DriverFinder: A Gene Length-Based Network Method to Identify Cancer Driver Genes , 2017, Complex..

[4]  Xiaoping Song,et al.  dbHDPLS: A database of human disease-related protein-ligand structures , 2019, Comput. Biol. Chem..

[5]  Thomas C. Northey,et al.  IntPred: a structure-based predictor of protein–protein interaction sites , 2017, Bioinform..

[6]  Bing Niu,et al.  Prediction of protein-protein interactions based on PseAA composition and hybrid feature selection. , 2009, Biochemical and biophysical research communications.

[7]  Jinyan Li,et al.  A Sequence-Based Dynamic Ensemble Learning System for Protein Ligand-Binding Site Prediction , 2016, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[8]  Jinyan Li,et al.  Protein binding hot spots prediction from sequence only by a new ensemble learning method , 2017, Amino Acids.

[9]  Yuehui Chen,et al.  A novel method for prediction of protein interaction sites based on integrated RBF neural networks , 2012, Comput. Biol. Medicine.

[10]  Changjun Jiang,et al.  A New Strategy for Protein Interface Identification Using Manifold Learning Method , 2014, IEEE Transactions on NanoBioscience.

[11]  Bing Wang,et al.  Prediction of Protein Hotspots from Whole Protein Sequences by a Random Projection Ensemble System , 2017, International journal of molecular sciences.

[12]  B. Wang,et al.  Inferring protein-protein interacting sites using residue conservation and evolutionary information. , 2006, Protein and peptide letters.

[13]  Jinyan Li,et al.  DomSVR: domain boundary prediction with support vector regression from sequence information alone , 2010, Amino Acids.

[14]  A. Valencia,et al.  Prediction of protein--protein interaction sites in heterocomplexes with neural networks. , 2002, European journal of biochemistry.

[15]  Sam Ansari,et al.  Statistical analysis of predominantly transient protein–protein interfaces , 2005, Proteins.

[16]  Ye Tian,et al.  A Decision Variable Clustering-Based Evolutionary Algorithm for Large-Scale Many-Objective Optimization , 2018, IEEE Transactions on Evolutionary Computation.

[17]  Ye Tian,et al.  An Indicator-Based Multiobjective Evolutionary Algorithm With Reference Point Adaptation for Better Versatility , 2018, IEEE Transactions on Evolutionary Computation.

[18]  Miin-Shen Yang,et al.  On the edited fuzzy K-nearest neighbor rule , 1998, IEEE Trans. Syst. Man Cybern. Part B.

[19]  Junfeng Xia,et al.  LNDriver: identifying driver genes by integrating mutation and expression data based on gene-gene interaction network , 2016, BMC Bioinformatics.

[20]  Tzu-Hao Kuo,et al.  Predicting Protein–Protein Interaction Sites Using Sequence Descriptors and Site Propensity of Neighboring Amino Acids , 2016, International journal of molecular sciences.

[21]  Peng Chen,et al.  Predicting protein interaction sites from residue spatial sequence profile and evolution rate , 2006, FEBS Letters.

[22]  Xiang Zhang,et al.  Radial basis function neural network ensemble for predicting protein-protein interaction sites in heterocomplexes. , 2010, Protein and peptide letters.

[23]  Lei Wang,et al.  A Novel Method for LncRNA-Disease Association Prediction Based on an lncRNA-Disease Association Network , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[24]  Lei Wang,et al.  A Novel Probability Model for LncRNA–Disease Association Prediction Based on the Naïve Bayesian Classifier , 2018, Genes.

[25]  Jun Zhang,et al.  Inferring protein-protein interactions using a hybrid genetic algorithm/support vector machine method. , 2010, Protein and peptide letters.

[26]  Xue-wen Chen,et al.  Sequence-based prediction of protein interaction sites with an integrative method , 2009, Bioinform..

[27]  Colin G. Johnson,et al.  A Hybrid Rule-Induction/Likelihood-Ratio Based Approach for Predicting Protein-Protein Interactions , 2009 .

[28]  Vladimir Vapnik,et al.  An overview of statistical learning theory , 1999, IEEE Trans. Neural Networks.

[29]  Lei Shi,et al.  A linear programming computational framework integrates phosphor-proteomics and prior knowledge to predict drug efficacy , 2017, BMC Systems Biology.

[30]  Ho-Lun Cheng,et al.  Accelerating smooth molecular surface calculation , 2017, Journal of Mathematical Biology.

[31]  Jinyan Li,et al.  dbMPIKT: a database of kinetic and thermodynamic mutant protein interactions , 2018, BMC Bioinformatics.

[32]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[33]  Xiaoping Li,et al.  Utilization of rotation-invariant uniform LBP histogram distribution and statistics of connected regions in automatic image annotation based on multi-label learning , 2017, Neurocomputing.

[34]  Jun Zhang,et al.  Hot spot prediction in protein-protein interactions by an ensemble system , 2018, BMC Systems Biology.

[35]  Keehyoung Joo,et al.  Protein‐binding site prediction based on three‐dimensional protein modeling , 2009, Proteins.

[36]  Dennis L. Wilson,et al.  Asymptotic Properties of Nearest Neighbor Rules Using Edited Data , 1972, IEEE Trans. Syst. Man Cybern..