Granular multiple kernel learning for identifying RNA-binding protein residues via integrating sequence and structure information

RNA-binding proteins play an important role in the biological process. However, the traditional experiment technology to predict RNA-binding residues is time-consuming and expensive, so the development of an effective computational approach can provide a strategy to solve this issue. In recent years, most of the computational approaches are constructed on protein sequence information, but the protein structure has not been considered. In this paper, we use a novel computational model of RNA-binding residues prediction, using protein sequence and structure information. Our hybrid features are encoded by local sequence and structure feature extraction models. Our predictor is built by employing the Granular Multiple Kernel Support Vector Machine with Repetitive Under-sampling (GMKSVM-RU). In order to evaluate our method, we use fivefold cross-validation on the RBP129, our method achieves better experimental performance with MCC of 0.3367 and accuracy of 88.84%. In order to further evaluate our model, an independent data set (RBP60) is employed, and our method achieves MCC of 0.3921 and accuracy of 87.52%. Above results demonstrate that integrating sequence and structure information is beneficial to improve the prediction ability of RNA-binding residues.

[1]  E. Katchalski‐Katzir,et al.  Molecular surface recognition: determination of geometric fit between proteins and their ligands by correlation techniques. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[2]  Fei Guo,et al.  Identification of Drug-Side Effect Association via Semisupervised Model and Multiple Kernel Learning , 2019, IEEE Journal of Biomedical and Health Informatics.

[3]  Eric Westhof,et al.  A Large-Scale Assessment of Nucleic Acids Binding Site Prediction Programs , 2015, PLoS Comput. Biol..

[4]  Lukasz A. Kurgan,et al.  Prediction and analysis of nucleotide-binding residues using sequence and sequence-derived structural descriptors , 2012, Bioinform..

[5]  Rong Liu,et al.  SNBRFinder: A Sequence-Based Hybrid Algorithm for Enhanced Prediction of Nucleic Acid-Binding Residues , 2015, PloS one.

[6]  Scott Dick,et al.  CRYSTALP2: sequence-based protein crystallization propensity prediction , 2009, BMC Structural Biology.

[7]  Jijun Tang,et al.  Identification of drug-side effect association via multiple information integration with centered kernel alignment , 2019, Neurocomputing.

[8]  Jijun Tang,et al.  Identification of Protein-Ligand Binding Sites by Sequence Information and Ensemble Classifier , 2017, J. Chem. Inf. Model..

[9]  Jian Yang,et al.  Improving protein-ATP binding residues prediction by boosting SVMs with random under-sampling , 2013, Neurocomputing.

[10]  Gajendra P.S. Raghava,et al.  Prediction of RNA binding sites in a protein using SVM and PSSM profile , 2008, Proteins.

[11]  Jijun Tang,et al.  Protein Crystallization Identification via Fuzzy Model on Linear Neighborhood Representation , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[12]  Meng-long Li,et al.  Identification of RNA-binding sites in proteins by integrating various sequence information , 2010, Amino Acids.

[13]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[14]  David J. States,et al.  Identification of protein coding regions by database similarity search , 1993, Nature Genetics.

[15]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[16]  Wen-Lian Hsu,et al.  Predicting RNA-binding sites of proteins using support vector machines and evolutionary information , 2008, BMC Bioinformatics.

[17]  Rasna R. Walia,et al.  RNABindRPlus: A Predictor that Combines Machine Learning and Sequence Homology-Based Methods to Improve the Reliability of Predicted RNA-Binding Residues in Proteins , 2014, PloS one.

[18]  Yanqing Zhang,et al.  SVMs Modeling for Highly Imbalanced Classification , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[19]  Jun Hu,et al.  DNAPred: Accurate Identification of DNA-Binding Sites from Protein Sequence by Ensembled Hyperplane-Distance-Based Support Vector Machines , 2019, J. Chem. Inf. Model..

[20]  Jijun Tang,et al.  MK-FSVM-SVDD: A Multiple Kernel-based Fuzzy SVM Model for Predicting DNA-binding Proteins via Support Vector Data Description , 2020, Current Bioinformatics.

[21]  Zheng Yuan,et al.  Exploiting structural and topological information to improve prediction of RNA-protein binding sites , 2009, BMC Bioinformatics.

[22]  Gabriele Varani,et al.  Protein families and RNA recognition , 2005, The FEBS journal.

[23]  Y. Shamoo,et al.  Structure-based analysis of protein-RNA interactions using the program ENTANGLE. , 2001, Journal of molecular biology.

[24]  Jianxiao Liu,et al.  Epi-GTBN: an approach of epistasis mining based on genetic Tabu algorithm and Bayesian network , 2019, BMC Bioinformatics.

[25]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[26]  Fei Guo,et al.  Identification of drug–target interactions via fuzzy bipartite local model , 2019, Neural Computing and Applications.

[27]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[28]  Jijun Tang,et al.  Human protein subcellular localization identification via fuzzy model on Kernelized Neighborhood Representation , 2020, Appl. Soft Comput..

[29]  Jijun Tang,et al.  Identification of drug-target interactions via multiple information integration , 2017, Inf. Sci..

[30]  Vasant Honavar,et al.  Struct-NB: predicting protein-RNA binding sites using structural features , 2010, Int. J. Data Min. Bioinform..

[31]  Yang Zhang,et al.  Protein-ligand binding site recognition using complementary binding-specific substructure comparison and sequence profile alignment , 2013, Bioinform..

[32]  Jijun Tang,et al.  AIEpred: An Ensemble Predictive Model of Classifier Chain to Identify Anti-Inflammatory Peptides , 2020, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[33]  Zixiang Wang,et al.  A boosting approach for prediction of protein-RNA binding residues , 2017, BMC Bioinformatics.

[34]  Michael R. Williams,et al.  Introduction of New Associate Editors , 2004, TCBB.

[35]  Lukasz Kurgan,et al.  Meta prediction of protein crystallization propensity. , 2009, Biochemical and biophysical research communications.

[36]  T. Glisovic,et al.  RNA‐binding proteins and post‐transcriptional gene regulation , 2008, FEBS letters.

[37]  D. Ritchie,et al.  Protein docking using spherical polar Fourier correlations , 2000, Proteins.

[38]  Zhengwei Zhu,et al.  CD-HIT: accelerated for clustering the next-generation sequencing data , 2012, Bioinform..

[39]  N. Go,et al.  Amino acid residue doublet propensity in the protein–RNA interface and its application to RNA interface prediction , 2006, Nucleic acids research.

[40]  Rong Liu,et al.  RBRDetector: Improved prediction of binding residues on RNA‐binding protein structures using complementary feature‐ and template‐based strategies , 2014, Proteins.

[41]  Vasant G Honavar,et al.  Prediction of RNA binding sites in proteins from amino acid sequence. , 2006, RNA.

[42]  Yanqing Zhang,et al.  Granular SVM with Repetitive Undersampling for Highly Imbalanced Protein Homology Prediction , 2006, 2006 IEEE International Conference on Granular Computing.

[43]  Jun Hu,et al.  TargetATPsite: A template‐free method for ATP‐binding sites prediction with residue evolution image sparse representation and classifier ensemble , 2013, J. Comput. Chem..

[44]  Zhichao Miao,et al.  Prediction of nucleic acid binding probability in proteins: a neighboring residue network based score , 2015, Nucleic acids research.

[45]  Jack Y. Yang,et al.  BindN+ for accurate prediction of DNA and RNA-binding residues from protein sequence features , 2010, BMC Systems Biology.

[46]  Vasant Honavar,et al.  PRIDB: a protein–RNA interface database , 2010, Nucleic Acids Res..

[47]  Haruki Nakamura,et al.  Protein function annotation from sequence: prediction of residues interacting with RNA , 2009, Bioinform..

[48]  Zhi-Ping Liu,et al.  Prediction of protein-RNA binding sites by a random forest method with combined features , 2010, Bioinform..

[49]  Dan Wang,et al.  Identifying protein-protein interface via a novel multi-scale local sequence and structural representation , 2019, BMC Bioinformatics.

[50]  Jijun Tang,et al.  Identification of Drug-Target Interactions via Dual Laplacian Regularized Least Squares with Multiple Kernel Fusion , 2020, Knowl. Based Syst..

[51]  Hao Wang,et al.  Identification of membrane protein types via multivariate information fusion with Hilbert-Schmidt Independence Criterion , 2020, Neurocomputing.

[52]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[53]  M. Sternberg,et al.  Modelling protein docking using shape complementarity, electrostatics and biochemical information. , 1997, Journal of molecular biology.

[54]  Dapeng Xiong,et al.  RBRIdent: An algorithm for improved identification of RNA‐binding residues in proteins from primary sequences , 2015, Proteins.