XGBPRH: Prediction of Binding Hot Spots at Protein–RNA Interfaces Utilizing Extreme Gradient Boosting

Hot spot residues at protein–RNA complexes are vitally important for investigating the underlying molecular recognition mechanism. Accurately identifying protein–RNA binding hot spots is critical for drug designing and protein engineering. Although some progress has been made by utilizing various available features and a series of machine learning approaches, these methods are still in the infant stage. In this paper, we present a new computational method named XGBPRH, which is based on an eXtreme Gradient Boosting (XGBoost) algorithm and can effectively predict hot spot residues in protein–RNA interfaces utilizing an optimal set of properties. Firstly, we download 47 protein–RNA complexes and calculate a total of 156 sequence, structure, exposure, and network features. Next, we adopt a two-step feature selection algorithm to extract a combination of 6 optimal features from the combination of these 156 features. Compared with the state-of-the-art approaches, XGBPRH achieves better performances with an area under the ROC curve (AUC) score of 0.817 and an F1-score of 0.802 on the independent test set. Meanwhile, we also apply XGBPRH to two case studies. The results demonstrate that the method can effectively identify novel energy hotspots.

[1]  Doheon Lee,et al.  A feature-based approach to modeling protein–protein interaction hot spots , 2009, Nucleic acids research.

[2]  Ronesh Sharma,et al.  Discovering MoRFs by trisecting intrinsically disordered protein sequence into terminals and middle regions , 2019, BMC Bioinformatics.

[3]  T. Petersen,et al.  A generic method for assignment of reliability scores applied to solvent accessibility predictions , 2009, BMC Structural Biology.

[4]  Hao Wang,et al.  Enhanced Prediction of Hot Spots at Protein-Protein Interfaces Using Extreme Gradient Boosting , 2018, Scientific Reports.

[5]  Hong Guo,et al.  Predicting protein–protein interaction sites using modified support vector machine , 2016, International Journal of Machine Learning and Cybernetics.

[6]  Zixiang Wang,et al.  Computational identification of binding energy hot spots in protein–RNA complexes using an ensemble approach , 2018, Bioinform..

[7]  Xingpeng Jiang,et al.  Sequence clustering in bioinformatics: an empirical study. , 2018, Briefings in bioinformatics.

[8]  J.C. Rajapakse,et al.  SVM-RFE With MRMR Filter for Gene Selection , 2010, IEEE Transactions on NanoBioscience.

[9]  J. Thornton,et al.  Satisfying hydrogen bonding potential in proteins. , 1994, Journal of molecular biology.

[10]  M. Selmer,et al.  Structure of ribosomal protein TL5 complexed with RNA provides new insights into the CTC family of stress proteins. , 2001, Acta crystallographica. Section D, Biological crystallography.

[11]  James G. Lyons,et al.  Improving prediction of secondary structure, local backbone angles, and solvent accessible surface area of proteins by iterative deep learning , 2015, Scientific Reports.

[12]  Zhigang Chen,et al.  PredHS: a web server for predicting protein–protein interaction hot spots by using structural neighborhood properties , 2014, Nucleic Acids Res..

[13]  Nita Parekh,et al.  NAPS: Network Analysis of Protein Structures , 2016, Nucleic Acids Res..

[14]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[15]  Lei Deng,et al.  Machine Learning Approaches for Protein–Protein Interaction Hot Spot Prediction: Progress and Comparative Assessment , 2018, Molecules.

[16]  C. Furlanello,et al.  Recursive feature elimination with random forest for PTR-MS analysis of agroindustrial products , 2006 .

[17]  James G. Lyons,et al.  SPIDER2: A Package to Predict Secondary Structure, Accessible Surface Area, and Main-Chain Torsional Angles by Deep Neural Networks. , 2017, Methods in molecular biology.

[18]  Anna Vangone,et al.  iSEE: Interface Structure, Evolution and Energy-based machine learning predictor of binding affinity changes upon mutations , 2018 .

[19]  Xia Sun,et al.  Drug and Nondrug Classification Based on Deep Learning with Various Feature Selection Strategies , 2018 .

[20]  Cong Shen,et al.  LPI-KTASLP: Prediction of LncRNA-Protein Interaction by Semi-Supervised Link Learning With Multivariate Information , 2019, IEEE Access.

[21]  Guoqing Wang,et al.  McTwo: a two-step feature selection algorithm based on maximal information coefficient , 2016, BMC Bioinformatics.

[22]  Tom Lenaerts,et al.  From protein sequence to dynamics and disorder with DynaMine , 2013, Nature Communications.

[23]  V. Lim,et al.  The Crucial Role of Conserved Intermolecular H-bonds Inaccessible to the Solvent in Formation and Stabilization of the TL5·5 SrRNA Complex* , 2005, Journal of Biological Chemistry.

[24]  Liujuan Cao,et al.  A novel features ranking metric with application to scalable visual and bioinformatics data classification , 2016, Neurocomputing.

[25]  Ronesh Sharma,et al.  OPAL+: Length‐Specific MoRF Prediction in Intrinsically Disordered Protein Sequences , 2018, Proteomics.

[26]  Ozlem Keskin,et al.  Analysis of single amino acid variations in singlet hot spots of protein‐protein interfaces , 2018, Bioinform..

[27]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[28]  B. Lee,et al.  The interpretation of protein structures: estimation of static accessibility. , 1971, Journal of molecular biology.

[29]  Ying Ju,et al.  Pretata: predicting TATA binding proteins with novel features and dimensionality reduction strategy , 2016, BMC Systems Biology.

[30]  Nick V Grishin,et al.  Effective scoring function for protein sequence design , 2003, Proteins.

[31]  Alexandre M J J Bonvin,et al.  iSEE: Interface structure, evolution, and energy‐based machine learning predictor of binding affinity changes upon mutations , 2019, Proteins.

[32]  Faisal Saeed,et al.  Bioactive Molecule Prediction Using Extreme Gradient Boosting , 2016, Molecules.

[33]  Pierre Baldi,et al.  SCRATCH: a protein structure and structural feature prediction server , 2005, Nucleic Acids Res..

[34]  Lei Deng,et al.  Targeting Virus-host Protein Interactions: Feature Extraction and Machine Learning Approaches. , 2019, Current drug metabolism.

[35]  Jiangning Song,et al.  HSEpred: predict half-sphere exposure from protein sequences , 2008, Bioinform..

[36]  Zikai Wu,et al.  Identifying responsive functional modules from protein-protein interaction network , 2009, Molecules and cells.

[37]  Lei Deng,et al.  SemiHS: an iterative semi-supervised approach for predicting protein-protein interaction hot spots. , 2011, Protein and peptide letters.

[38]  T. Gibson,et al.  Protein disorder prediction: implications for structural proteomics. , 2003, Structure.

[39]  David T. Jones,et al.  DISOPRED3: precise disordered region predictions with annotated protein-binding activity , 2014, Bioinform..

[40]  Witold R. Rudnicki,et al.  Boruta - A System for Feature Selection , 2010, Fundam. Informaticae.

[41]  Thomas C. Northey,et al.  IntPred: a structure-based predictor of protein–protein interaction sites , 2017, Bioinform..

[42]  Jijun Tang,et al.  Identification of Protein-Ligand Binding Sites by Sequence Information and Ensemble Classifier , 2017, J. Chem. Inf. Model..

[43]  Thomas Tuschl,et al.  Structure-function studies of STAR family Quaking proteins bound to their in vivo RNA target sites. , 2013, Genes & development.

[44]  Q. Zou,et al.  Gene2vec: gene subsequence embedding for prediction of mammalian N6-methyladenosine sites from mRNA , 2018, RNA.

[45]  Xing-Ming Zhao,et al.  APIS: accurate prediction of hot spots in protein interfaces by combining protrusion index with solvent accessibility , 2010, BMC Bioinformatics.

[46]  Chenhsiung Chan,et al.  Relationship between local structural entropy and protein thermostabilty , 2004, Proteins.

[47]  Lei Chen,et al.  Identification of Drug-Drug Interactions Using Chemical Interactions , 2017 .

[48]  Shuigeng Zhou,et al.  Prediction of protein-protein interaction sites using an ensemble method , 2009, BMC Bioinformatics.

[49]  Jijun Tang,et al.  Identification of Residue-Residue Contacts Using a Novel Coevolution- Based Method , 2016 .

[50]  Ronesh Sharma,et al.  OPAL: prediction of MoRF regions in intrinsically disordered protein sequences , 2018, Bioinform..

[51]  Kristian Vlahovicek,et al.  Prediction of Protein–Protein Interaction Sites in Sequences and 3D Structures by Random Forests , 2009, PLoS Comput. Biol..

[52]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[53]  J. Alison Noble,et al.  Improving the Classification Accuracy of the Classic RF Method by Intelligent Feature Selection and Weighted Voting of Trees with Application to Medical Image Segmentation , 2011, MLMI.

[54]  Jijun Tang,et al.  Local-DPP: An improved DNA-binding protein prediction method by exploring local evolutionary information , 2017, Inf. Sci..

[55]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[56]  M. Šikić,et al.  PSAIA – Protein Structure and Interaction Analyzer , 2008, BMC Structural Biology.

[57]  Alexandre M J J Bonvin,et al.  SpotOn: High Accuracy Identification of Protein-Protein Interface Hot-Spots , 2017, Scientific Reports.

[58]  T. Hamelryck An amino acid has two sides: A new 2D measure provides a different view of solvent exposure , 2005, Proteins.

[59]  Minoru Kanehisa,et al.  AAindex: amino acid index database, progress report 2008 , 2007, Nucleic Acids Res..

[60]  Zixiang Wang,et al.  Ontological function annotation of long non‐coding RNAs through hierarchical multi‐label classification , 2018, Bioinform..

[61]  Mona Singh,et al.  Predicting functionally important residues from sequence conservation , 2007, Bioinform..

[62]  David S. Goodsell,et al.  The RCSB Protein Data Bank: redesigned web site and web services , 2010, Nucleic Acids Res..

[63]  Amita Barik,et al.  Probing binding hot spots at protein–RNA recognition sites , 2015, Nucleic acids research.

[64]  Zhiqiang Ma,et al.  Prediction of conformational B-cell epitope binding with individual antibodies using phage display peptides , 2016 .

[65]  Yaoqi Zhou,et al.  Consensus scoring for enriching near‐native structures from protein–protein docking decoys , 2009, Proteins.

[66]  Zixiang Wang,et al.  A boosting approach for prediction of protein-RNA binding residues , 2017, BMC Bioinformatics.