Computational identification of binding energy hot spots in protein–RNA complexes using an ensemble approach

Motivation Identifying RNA-binding residues, especially energetically favored hot spots, can provide valuable clues for understanding the mechanisms and functional importance of protein-RNA interactions. Yet, limited availability of experimentally recognized energy hot spots in protein-RNA crystal structures leads to the difficulties in developing empirical identification approaches. Computational prediction of RNA-binding hot spot residues is still in its infant stage. Results Here, we describe a computational method, PrabHot (Prediction of protein-RNA binding hot spots), that can effectively detect hot spot residues on protein-RNA binding interfaces using an ensemble of conceptually different machine learning classifiers. Residue interaction network features and new solvent exposure characteristics are combined together and selected for classification with the Boruta algorithm. In particular, two new reference datasets (benchmark and independent) have been generated containing 107 hot spots from 47 known protein-RNA complex structures. In 10-fold cross-validation on the training dataset, PrabHot achieves promising performances with an AUC score of 0.86 and a sensitivity of 0.78, which are significantly better than that of the pioneer RNA-binding hot spot prediction method HotSPRing. We also demonstrate the capability of our proposed method on the independent test dataset and gain a competitive advantage as a result. Availability and implementation The PrabHot webserver is freely available at http://denglab.org/PrabHot/. Contact leideng@csu.edu.cn. Supplementary information Supplementary data are available at Bioinformatics online.

[1]  Minoru Kanehisa,et al.  AAindex: Amino Acid index database , 2000, Nucleic Acids Res..

[2]  J. Thornton,et al.  Satisfying hydrogen bonding potential in proteins. , 1994, Journal of molecular biology.

[3]  M. Selmer,et al.  Structure of ribosomal protein TL5 complexed with RNA provides new insights into the CTC family of stress proteins. , 2001, Acta crystallographica. Section D, Biological crystallography.

[4]  Li Yang,et al.  Predicting disease-associated substitution of a single amino acid by analyzing residue interactions , 2011, BMC Bioinformatics.

[5]  James G. Lyons,et al.  Improving prediction of secondary structure, local backbone angles, and solvent accessible surface area of proteins by iterative deep learning , 2015, Scientific Reports.

[6]  Nita Parekh,et al.  NAPS: Network Analysis of Protein Structures , 2016, Nucleic Acids Res..

[7]  Pierre Geurts,et al.  Extremely randomized trees , 2006, Machine Learning.

[8]  Xiang-Sun Zhang,et al.  Prediction of hot spots in protein interfaces using a random forest model with hybrid features. , 2012, Protein engineering, design & selection : PEDS.

[9]  Nick V Grishin,et al.  Effective scoring function for protein sequence design , 2003, Proteins.

[10]  Shuigeng Zhou,et al.  Prediction of protein-protein interaction sites using an ensemble method , 2009, BMC Bioinformatics.

[11]  Shuigeng Zhou,et al.  Boosting Prediction Performance of Protein-Protein Interaction Hot Spots by Using Structural Neighborhood Properties - (Extended Abstract) , 2013, RECOMB.

[12]  J. Murray,et al.  The three-dimensional structures of two complexes between recombinant MS2 capsids and RNA operator fragments reveal sequence-specific protein-RNA interactions. , 1997, Journal of molecular biology.

[13]  Mona Singh,et al.  Predicting functionally important residues from sequence conservation , 2007, Bioinform..

[14]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[15]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  O. Uhlenbeck,et al.  Alanine scanning of MS2 coat protein reveals protein-phosphate contacts involved in thermodynamic hot spots. , 2006, Journal of molecular biology.

[17]  Xing-Ming Zhao,et al.  APIS: accurate prediction of hot spots in protein interfaces by combining protrusion index with solvent accessibility , 2010, BMC Bioinformatics.

[18]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[19]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[20]  N. Go,et al.  Amino acid residue doublet propensity in the protein–RNA interface and its application to RNA interface prediction , 2006, Nucleic acids research.

[21]  Xiang-Sun Zhang,et al.  De novo prediction of RNA-protein interactions from sequence information. , 2013, Molecular bioSystems.

[22]  Vasant Honavar,et al.  Protein-RNA interface residue prediction using machine learning: an assessment of the state of the art , 2012, BMC Bioinformatics.

[23]  V. Lim,et al.  The Crucial Role of Conserved Intermolecular H-bonds Inaccessible to the Solvent in Formation and Stabilization of the TL5·5 SrRNA Complex* , 2005, Journal of Biological Chemistry.

[24]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[25]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[26]  Yaoqi Zhou,et al.  Consensus scoring for enriching near‐native structures from protein–protein docking decoys , 2009, Proteins.

[27]  Haruki Nakamura,et al.  PiRaNhA: a server for the computational prediction of RNA-binding residues in protein sequences , 2010, Nucleic Acids Res..

[28]  P. Gollnick,et al.  Alanine-scanning mutagenesis of Bacillus subtilis trp RNA-binding attenuation protein (TRAP) reveals residues involved in tryptophan binding and RNA binding. , 1997, Journal of molecular biology.

[29]  Jingpu Zhang,et al.  KATZLGO: Large-Scale Prediction of LncRNA Functions by Using the KATZ Measure Based on Multiple Networks , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[30]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[31]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[32]  T. Petersen,et al.  A generic method for assignment of reliability scores applied to solvent accessibility predictions , 2009, BMC Structural Biology.

[33]  Jack Y. Yang,et al.  BindN+ for accurate prediction of DNA and RNA-binding residues from protein sequence features , 2010, BMC Systems Biology.

[34]  Amita Barik,et al.  Probing binding hot spots at protein–RNA recognition sites , 2015, Nucleic acids research.

[35]  J. Ule,et al.  Protein–RNA interactions: new genomic technologies and perspectives , 2012, Nature Reviews Genetics.

[36]  Simon J. Hubbard,et al.  Department of Biochemistry and Molecular Biology , 2006 .

[37]  Rainer Merkl,et al.  The NHL domain of BRAT is an RNA-binding domain that directly contacts the hunchback mRNA for regulation , 2014, Genes & development.

[38]  D. Bailey,et al.  The Binding Interface Database (BID): A Compilation of Amino Acid Hot Spots in Protein Interfaces , 2003, Bioinform..

[39]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[40]  A. del Sol,et al.  Small‐world network approach to identify key residues in protein–protein interaction , 2004, Proteins.

[41]  Lei Deng,et al.  Accurate prediction of functional effects for variants by combining gradient tree boosting with optimal neighborhood properties , 2017, PloS one.

[42]  Pierre Baldi,et al.  SCRATCH: a protein structure and structural feature prediction server , 2005, Nucleic Acids Res..

[43]  Gajendra P.S. Raghava,et al.  Prediction of RNA binding sites in a protein using SVM and PSSM profile , 2008, Proteins.

[44]  Jiangning Song,et al.  HSEpred: predict half-sphere exposure from protein sequences , 2008, Bioinform..

[45]  Jingpu Zhang,et al.  Integrating Multiple Heterogeneous Networks for Novel LncRNA-Disease Association Inference , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[46]  Rasna R. Walia,et al.  RNABindRPlus: A Predictor that Combines Machine Learning and Sequence Homology-Based Methods to Improve the Reliability of Predicted RNA-Binding Residues in Proteins , 2014, PloS one.

[47]  Witold R. Rudnicki,et al.  Feature Selection with the Boruta Package , 2010 .

[48]  J. Friedman Stochastic gradient boosting , 2002 .

[49]  Ozlem Keskin,et al.  Identification of computational hot spots in protein interfaces: combining solvent accessibility and inter-residue potentials improves the accuracy , 2009, Bioinform..

[50]  Zhi-Ping Liu,et al.  Prediction of protein-RNA binding sites by a random forest method with combined features , 2010, Bioinform..

[51]  Yael Mandel-Gutfreund,et al.  BindUP: a web server for non-homology-based prediction of DNA and RNA binding proteins , 2016, Nucleic Acids Res..

[52]  T. Hamelryck An amino acid has two sides: A new 2D measure provides a different view of solvent exposure , 2005, Proteins.

[53]  Zhigang Chen,et al.  PredHS: a web server for predicting protein–protein interaction hot spots by using structural neighborhood properties , 2014, Nucleic Acids Res..

[54]  Gil Amitai,et al.  Network analysis of protein structures identifies functional residues. , 2004, Journal of molecular biology.

[55]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[56]  Juan Fernández-Recio,et al.  SKEMPI: a Structural Kinetic and Energetic database of Mutant Protein Interactions and its use in empirical models , 2012, Bioinform..

[57]  Emil Alexov,et al.  Predicting Binding Free Energy Change Caused by Point Mutations with Knowledge-Modified MM/PBSA Method , 2015, PLoS Comput. Biol..

[58]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[59]  T. Gibson,et al.  Protein disorder prediction: implications for structural proteomics. , 2003, Structure.

[60]  David T. Jones,et al.  DISOPRED3: precise disordered region predictions with annotated protein-binding activity , 2014, Bioinform..

[61]  Kurt S. Thorn,et al.  ASEdb: a database of alanine mutations and their effects on the free energy of binding in protein interactions , 2001, Bioinform..

[62]  Doheon Lee,et al.  A feature-based approach to modeling protein–protein interaction hot spots , 2009, Nucleic acids research.

[63]  Jeroen Krijgsveld,et al.  Comprehensive Identification of RNA-Binding Proteins by RNA Interactome Capture. , 2016, Methods in molecular biology.

[64]  Chenhsiung Chan,et al.  Relationship between local structural entropy and protein thermostabilty , 2004, Proteins.