A novel sequence-based prediction method for ATP-binding sites using fusion of SMOTE algorithm and random forests classifier

Abstract Correctly identifying the protein-ATP binding site is valuable for both protein function annotation and new drug discovery. However, the number of non-ATP-binding residues is much more than the number of ATP-binding residues, which makes the prediction a classical imbalanced learning problem. Previous studies often apply the under-sampling technique to construct a relatively balanced dataset, but some information is inevitably lost during the sample process. In this work, we utilize the SMOTE algorithm, which generates the balanced dataset by generating ATP-binding sites with the idea of interpolation. The Random Forest is selected as classifier to ensure the acceptable training speed. With the combination of complementary template-based method, the prediction performance of the proposed method is further improved. After comparing with other sequence-based predictors, our proposed method achieves satisfying performance and proved to be efficient for ATP-binding sites prediction.

[1]  Fumio Hasegawa,et al.  Stemness and anti‐cancer drug resistance in ATP‐binding cassette subfamily G member 2 highly expressed pancreatic cancer is induced in 3D culture conditions , 2018, Cancer science.

[2]  T. Katayama,et al.  Negative control of DNA replication by hydrolysis of ATP bound to DnaA protein, the initiator of chromosomal DNA replication in Escherichia coli , 1997, The EMBO journal.

[3]  Charu Chaudhry,et al.  Role of the γ‐phosphate of ATP in triggering protein folding by GroEL–GroES: function, structure and energetics , 2003, The EMBO journal.

[4]  Gajendra P. S. Raghava,et al.  Identification of ATP binding residues of a protein from its primary sequence , 2009, BMC Bioinformatics.

[5]  Hiroko Nakatsukasa,et al.  Role of ATP as a Key Signaling Molecule Mediating Radiation-Induced Biological Effects , 2017, Dose-response : a publication of International Hormesis Society.

[6]  Rong Liu,et al.  SNBRFinder: A Sequence-Based Hybrid Algorithm for Enhanced Prediction of Nucleic Acid-Binding Residues , 2015, PloS one.

[7]  Rok Blagus,et al.  Evaluation of SMOTE for High-Dimensional Class-Imbalanced Microarray Data , 2012, 2012 11th International Conference on Machine Learning and Applications.

[8]  Lisa M. Ebert,et al.  A selective ATP-competitive sphingosine kinase inhibitor demonstrates anti-cancer properties , 2015, Oncotarget.

[9]  R. Ke,et al.  Mechanisms of AMPK in the maintenance of ATP balance during energy metabolism , 2018, Cell biology international.

[10]  Jun Hu,et al.  TargetATPsite: A template‐free method for ATP‐binding sites prediction with residue evolution image sparse representation and classifier ensemble , 2013, J. Comput. Chem..

[11]  Lukasz A. Kurgan,et al.  Prediction and analysis of nucleotide-binding residues using sequence and sequence-derived structural descriptors , 2012, Bioinform..

[12]  E. Flescher,et al.  Mitochondria-mediated ATP depletion by anti-cancer agents of the jasmonate family , 2007, Journal of bioenergetics and biomembranes.

[13]  Lance Chun Che Fung,et al.  Classification of Imbalanced Data by Combining the Complementary Neural Network and SMOTE Algorithm , 2010, ICONIP.

[14]  Abhishikha Srivastava,et al.  Prediction of zinc binding sites in proteins using sequence derived information , 2018, Journal of biomolecular structure & dynamics.

[15]  K. Oka,et al.  Transitional correlation between inner-membrane potential and ATP levels of neuronal mitochondria , 2018, Scientific Reports.

[16]  Jian Yang,et al.  Improving protein-ATP binding residues prediction by boosting SVMs with random under-sampling , 2013, Neurocomputing.

[17]  Lukasz Kurgan,et al.  ATPsite: sequence-based prediction of ATP-binding residues , 2011, Proteome Science.

[18]  Suresh Kumar Prediction of Metal Ion Binding Sites in Proteins from Amino Acid Sequences by Using Simplified Amino Acid Alphabets and Random Forest Model , 2017, Genomics & informatics.

[19]  Jun Hu,et al.  ATPbind: Accurate Protein-ATP Binding Site Prediction by Combining Sequence-Profiling and Structure-Based Comparisons , 2018, J. Chem. Inf. Model..

[20]  Yaoqi Zhou,et al.  Accurate single‐sequence prediction of solvent accessible surface area using local and global features , 2014, Proteins.

[21]  Daozheng Chen,et al.  Predicting Protein Ligand Binding Sites with Structure Alignment Method on Hadoop , 2016 .

[22]  Zhengwei Zhu,et al.  CD-HIT: accelerated for clustering the next-generation sequencing data , 2012, Bioinform..

[23]  Jing Hu,et al.  TSC_ATP: A two-stage classifier for predicting protein-ATP binding sites from protein sequence , 2015, 2015 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB).

[24]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[25]  Stefan Günther,et al.  SuperSite: dictionary of metabolite and drug binding sites in proteins , 2008, Nucleic Acids Res..

[26]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[27]  Kuo-Chen Chou,et al.  Identification of protein-protein binding sites by incorporating the physicochemical properties and stationary wavelet transforms into pseudo amino acid composition , 2016, Journal of biomolecular structure & dynamics.

[28]  Rasna R. Walia,et al.  RNABindRPlus: A Predictor that Combines Machine Learning and Sequence Homology-Based Methods to Improve the Reliability of Predicted RNA-Binding Residues in Proteins , 2014, PloS one.

[29]  Liam J. McGuffin,et al.  The PSIPRED protein structure prediction server , 2000, Bioinform..

[30]  Michal Brylinski,et al.  Prediction of protein–protein interaction sites from weakly homologous template structures using meta‐threading and machine learning , 2015, Journal of molecular recognition : JMR.