SubMito-XGBoost: predicting protein submitochondrial localization by fusing multiple feature information and eXtreme gradient boosting

MOTIVATION Mitochondria are an essential organelle in most eukaryotes. They not only play an important role in energy metabolism but also take part in many critical cytopathological processes. Abnormal mitochondria can trigger a series of human diseases, such as Parkinson's disease, multifactor disorder, and Type-II diabetes. Protein submitochondrial localization enables the understanding of protein function in studying disease pathogenesis and drug design. RESULTS We proposed a new method, SubMito-XGBoost, for protein submitochondrial localization prediction. Three steps are included: (i) the g-gap dipeptide composition (g-gap DC), pseudo-amino acid composition (PseAAC), auto-correlation function (ACF), and Bi-gram position-specific scoring matrix (Bi-gram PSSM) are employed to extract protein sequence features, (ii) Synthetic Minority Oversampling Technique (SMOTE) is used to balance samples, and the ReliefF algorithm is applied for feature selection, and (iii) the obtained feature vectors are fed into XGBoost to predict protein submitochondrial locations. SubMito-XGBoost has obtained satisfactory prediction results by the leave-one-out-cross-validation (LOOCV) compared with existing methods. The prediction accuracies of the SubMito-XGBoost method on the two training datasets M317 and M983 were 97.7% and 98.9%, which are 2.8-12.5% and 3.8-9.9% higher than other methods, respectively. The prediction accuracy of the independent test set M495 was 94.8%, which is significantly better than the existing studies. The proposed method also achieves satisfactory predictive performance on plant and non-plant protein submitochondrial datasets. SubMito-XGBoost also plays an important role in new drug design for the treatment of related diseases. AVAILABILITY AND IMPLEMENTATION The source codes and data are publicly available at https://github.com/QUST-AIBBDRC/SubMito-XGBoost/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  Kuo-Chen Chou,et al.  Fuzzy KNN for predicting membrane protein types from pseudo-amino acid composition. , 2006, Journal of theoretical biology.

[2]  Liujuan Cao,et al.  A novel features ranking metric with application to scalable visual and bioinformatics data classification , 2016, Neurocomputing.

[3]  Kuo-Chen Chou,et al.  iATC-mISF: a multi-label classifier for predicting the classes of anatomical therapeutic chemicals. , 2017, Bioinformatics.

[4]  K. Gempel,et al.  Mitochondria and Diabetes: Genetic, Biochemical, and Clinical Implications of the Cellular Energy Circuit , 1996, Diabetes.

[5]  Witold Pedrycz,et al.  Granular multi-label feature selection based on mutual information , 2017, Pattern Recognit..

[6]  Faisal Saeed,et al.  Bioactive Molecule Prediction Using Extreme Gradient Boosting , 2016, Molecules.

[7]  Li-na Wang,et al.  Accurate in silico prediction of species-specific methylation sites based on information gain feature optimization , 2016, Bioinform..

[8]  Yanzhi Guo,et al.  Using the augmented Chou's pseudo amino acid composition for predicting protein submitochondria locations based on auto covariance approach. , 2009, Journal of theoretical biology.

[9]  Xiaoying Wang,et al.  Protein-protein interaction sites prediction by ensemble random forests with synthetic minority oversampling technique , 2018, Bioinform..

[10]  K. Chou Using subsite coupling to predict signal peptides. , 2001, Protein engineering.

[11]  Yang Zhang,et al.  NeBcon: protein contact map prediction using neural network training coupled with naïve Bayes classifiers , 2017, Bioinform..

[12]  Xiaoqi Zheng,et al.  Protein submitochondrial localization from integrated sequence representation and SVM-based backward feature extraction. , 2015, Molecular bioSystems.

[13]  Wei Chen,et al.  Using Over-Represented Tetrapeptides to Predict Protein Submitochondria Locations , 2013, Acta Biotheoretica.

[14]  Ying-Li Chen,et al.  Prediction of the subcellular location of apoptosis proteins. , 2007, Journal of theoretical biology.

[15]  Loris Nanni,et al.  Genetic programming for creating Chou’s pseudo amino acid based features for submitochondria localization , 2008, Amino Acids.

[16]  Larry A. Rendell,et al.  The Feature Selection Problem: Traditional Methods and a New Algorithm , 1992, AAAI.

[17]  Qian-zhong Li,et al.  Predicting protein submitochondria locations by combining different descriptors into the general form of Chou’s pseudo amino acid composition , 2011, Amino Acids.

[18]  Alan Wee-Chung Liew,et al.  Structure‐based prediction of protein‐ peptide binding regions using Random Forest , 2018, Bioinform..

[19]  Li Zhang,et al.  Identify submitochondria and subchloroplast locations with pseudo amino acid composition: approach from the strategy of discrete wavelet transform feature extraction. , 2011, Biochimica et biophysica acta.

[20]  Kuo-Chen Chou,et al.  Predicting eukaryotic protein subcellular location by fusing optimized evidence-theoretic K-Nearest Neighbor classifiers. , 2006, Journal of proteome research.

[21]  K. Chou,et al.  PseAAC: a flexible web server for generating various kinds of protein pseudo amino acid composition. , 2008, Analytical biochemistry.

[22]  Yuan Yu,et al.  SubMito-PSPCP: Predicting Protein Submitochondrial Locations by Hybridizing Positional Specific Physicochemical Properties with Pseudoamino Acid Compositions , 2013, BioMed research international.

[23]  Maqsood Hayat,et al.  Prediction of Protein Submitochondrial Locations by Incorporating Dipeptide Composition into Chou’s General Pseudo Amino Acid Composition , 2016, The Journal of Membrane Biology.

[24]  Suyu Mei,et al.  Multi-kernel transfer learning based on Chou's PseAAC formulation for protein submitochondria localization. , 2012, Journal of theoretical biology.

[25]  María Martín,et al.  UniProt: A hub for protein information , 2015 .

[26]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[27]  Minghui Wang,et al.  Predicting protein submitochondrial locations by incorporating the pseudo-position specific scoring matrix into the general Chou's pseudo-amino acid composition. , 2018, Journal of theoretical biology.

[28]  Wolfgang E. Trommer Journal of Membrane Biology: Editorial , 2016, The Journal of Membrane Biology.

[29]  Kuo-Chen Chou,et al.  Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes , 2005, Bioinform..

[30]  Kuo-Chen Chou,et al.  pLoc-mGneg: Predict subcellular localization of Gram-negative bacterial proteins by deep gene ontology learning via general PseAAC. , 2017, Genomics.

[31]  B. Moshiri,et al.  Prediction of protein submitochondria locations based on data fusion of various features of sequences. , 2011, Journal of theoretical biology.

[32]  Chun-Wei Lin,et al.  Producing computationally efficient KPCA-based feature extraction for classification problems , 2010 .

[33]  Xing Chen,et al.  Prediction of subcellular location of apoptosis proteins by incorporating PsePSSM and DCCA coefficient based on LFDA dimensionality reduction , 2018, BMC Genomics.

[34]  C. Zhang,et al.  Prediction of protein (domain) structural classes based on amino-acid index. , 1999, European journal of biochemistry.

[35]  Qian-zhong Li,et al.  Using increment of diversity to predict mitochondrial proteins of malaria parasite: integrating pseudo-amino acid composition and structural alphabet , 2010, Amino Acids.

[36]  Manish Kumar,et al.  Proteome-wide prediction and annotation of mitochondrial and sub-mitochondrial proteins by incorporating domain information. , 2017, Mitochondrion.

[37]  Xing Chen,et al.  EGBMMDA: Extreme Gradient Boosting Machine for MiRNA-Disease Association prediction , 2018, Cell Death & Disease.

[38]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[39]  K. Chou,et al.  pLoc-mEuk: Predict subcellular localization of multi-label eukaryotic proteins by extracting the key GO information into general PseAAC. , 2018, Genomics.

[40]  Yanda Li,et al.  Prediction of protein submitochondria locations by hybridizing pseudo-amino acid composition with various physicochemical features of segmented sequence , 2006, BMC Bioinformatics.

[41]  Liang Fang,et al.  Imbalance learning for the prediction of N6-Methylation sites in mRNAs , 2018, BMC Genomics.

[42]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[43]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001 .

[44]  Ahmed El-Henawy,et al.  A comparative Analytical Studies onAcaciapolyacantha gum Samples collected from three different locations in Sudan , 2014 .

[45]  Johannes K Richter,et al.  Decision tree analysis in subarachnoid hemorrhage: prediction of outcome parameters during the course of aneurysmal subarachnoid hemorrhage using decision tree analysis. , 2018, Journal of neurosurgery.

[46]  Sohee Jeon,et al.  Dopamine oxidation mediates mitochondrial and lysosomal dysfunction in Parkinson’s disease , 2017, Science.

[47]  Piero Fariselli,et al.  A new decoding algorithm for hidden Markov models improves the prediction of the topology of all-beta membrane proteins , 2005, BMC Bioinformatics.

[48]  Bráulio Roberto Gonçalves Marinho Couto,et al.  Retrieval of Enterobacteriaceae drug targets using singular value decomposition , 2015, Bioinform..

[49]  T. Sawatari,et al.  The use of multidimensional perceptual models in the selection of sonar echo features. , 1985, The Journal of the Acoustical Society of America.

[50]  H. Ding,et al.  Identification of mitochondrial proteins of malaria parasite using analysis of variance , 2014, Amino Acids.

[51]  Sher Afzal Khan,et al.  Bi-PSSM: Position specific scoring matrix based intelligent computational model for identification of mycobacterial membrane proteins. , 2017, Journal of theoretical biology.

[52]  Chen Lin,et al.  LibD3C: Ensemble classifiers with a clustering and dynamic selection strategy , 2014, Neurocomputing.

[53]  Andy Liaw,et al.  Extreme Gradient Boosting as a Method for Quantitative Structure-Activity Relationships , 2016, J. Chem. Inf. Model..

[54]  Pu-Feng Du,et al.  Predicting protein submitochondrial locations by incorporating the positional-specific physicochemical properties into Chou's general pseudo-amino acid compositions. , 2017, Journal of theoretical biology.