Prediction of RNA‐binding residues in proteins from primary sequence using an enriched random forest model with a novel hybrid feature

The identification of RNA‐binding residues in proteins is important in several areas such as protein function, posttranscriptional regulation and drug design. We have developed PRBR (Prediction of RNA Binding Residues), a novel method for identifying RNA‐binding residues from amino acid sequences. Our method combines a hybrid feature with the enriched random forest (ERF) algorithm. The hybrid feature is composed of predicted secondary structure information and three novel features: evolutionary information combined with conservation information of the physicochemical properties of amino acids and the information about dependency of amino acids with regards to polarity‐charge and hydrophobicity in the protein sequences. Our results demonstrate that the PRBR model achieves 0.5637 Matthew's correlation coefficient (MCC) and 88.63% overall accuracy (ACC) with 53.70% sensitivity (SE) and 96.97% specificity (SP). By comparing the performance of each feature we found that all three novel features contribute to the improved predictions. Area under the curve (AUC) statistics from receiver operating characteristic curve analysis was compared between PRBR model and other models. The results show that PRBR achieves the highest AUC value (0.8675) which represents that PRBR attains excellent performance on predicting the RNA‐binding residues in proteins. The PRBR web‐server implementation is freely available at http://www.cbi.seu.edu.cn/PRBR/. Proteins 2011; © 2011 Wiley‐Liss, Inc.

[1]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[2]  James P. Egan,et al.  Signal detection theory and ROC analysis , 1975 .

[3]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[4]  P. Argos,et al.  Seventy‐five percent accuracy in protein secondary structure prediction , 1997, Proteins.

[5]  E Westhof,et al.  RNA as a drug target: chemical, modelling, and evolutionary tools. , 1998, Current opinion in biotechnology.

[6]  J. J. B. Anderson,et al.  Computational identification of cis-acting elements affecting post-transcriptional control of gene expression in Saccharomyces cerevisiae. , 2000, Nucleic acids research.

[7]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[8]  Danail Bonchev,et al.  The Overall Wiener Index-A New Tool for Characterization of Molecular Topology , 2001, J. Chem. Inf. Comput. Sci..

[9]  W. Delano The PyMOL Molecular Graphics System , 2002 .

[10]  Robert E. Buntrock,et al.  ChemOffice Ultra 7.0 , 2002, J. Chem. Inf. Comput. Sci..

[11]  William O Thompson,et al.  Low Volume Bowel Preparation for Colonoscopy: Randomized, Endoscopist-Blinded Trial of Liquid Sodium Phosphate Versus Tablet Sodium Phosphate , 2003, American Journal of Gastroenterology.

[12]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Re: Balaban et al.—low volume bowel preparation for colonoscopy: randomized endoscopist-blinded trial of liquid sodium phosphate versus tablet sodium phosphate , 2003, American Journal of Gastroenterology.

[14]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[15]  Satoru Miyano,et al.  A neural network method for identification of RNA-interacting residues in protein. , 2004, Genome informatics. International Conference on Genome Informatics.

[16]  Yujie Cai,et al.  The influence of dipeptide composition on protein thermostability , 2004, FEBS letters.

[17]  Shandar Ahmad,et al.  PSSM-based prediction of DNA binding sites in proteins , 2005, BMC Bioinformatics.

[18]  E. Kittler,et al.  The cellular HIV-1 Rev cofactor hRIP is required for viral replication. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[19]  William J Welsh,et al.  Improved method for predicting beta-turn using support vector machine. , 2005, Bioinformatics.

[20]  William J. Welsh,et al.  Improved method for predicting ?-turn using support vector machine , 2005, Bioinform..

[21]  Alternative RNA splicing and drug target identification. , 2005, IDrugs : the investigational drugs journal.

[22]  Paul Horton,et al.  Discrimination of outer membrane proteins using support vector machines , 2005, Bioinform..

[23]  Mei Liu,et al.  Prediction of protein-protein interactions using random decision forest framework , 2005, Bioinform..

[24]  Shandar Ahmad,et al.  Application of residue distribution along the sequence for discriminating outer membrane proteins , 2005, Comput. Biol. Chem..

[25]  Liangjiang Wang,et al.  BindN: a web-based tool for efficient prediction of DNA and RNA binding sites in amino acid sequences , 2006, Nucleic Acids Res..

[26]  Vasant G Honavar,et al.  Prediction of RNA binding sites in proteins from amino acid sequence. , 2006, RNA.

[27]  Antoine Geissbühler,et al.  Learning from imbalanced data in surveillance of nosocomial infection , 2006, Artif. Intell. Medicine.

[28]  Susan J. Brown,et al.  Prediction of RNA-Binding Residues in Protein Sequences Using Support Vector Machines , 2006, 2006 International Conference of the IEEE Engineering in Medicine and Biology Society.

[29]  Jae-Hyung Lee,et al.  RNABindR: a server for analyzing and predicting RNA-binding sites in proteins , 2007, Nucleic Acids Res..

[30]  Drug target interaction energies by the kernel energy method in aminoglycoside drugs and ribosomal A site RNA targets , 2007, Proceedings of the National Academy of Sciences.

[31]  Y. Wang,et al.  PRINTR: Prediction of RNA binding sites in proteins using SVM and profiles , 2008, Amino Acids.

[32]  Yao Chi Chen,et al.  Predicting RNA-binding sites from the protein structure based on electrostatics, evolution and geometry , 2008, Nucleic acids research.

[33]  Peng Jiang,et al.  RISP: A web-based server for prediction of RNA-binding sites in proteins , 2008, Comput. Methods Programs Biomed..

[34]  Gajendra P.S. Raghava,et al.  Prediction of RNA binding sites in a protein using SVM and PSSM profile , 2008, Proteins.

[35]  Yung-Seop Lee,et al.  Enriched random forests , 2008, Bioinform..

[36]  Wen-Lian Hsu,et al.  Predicting RNA-binding sites of proteins using support vector machines and evolutionary information , 2008, BMC Bioinformatics.

[37]  R. Krug,et al.  Interaction of the Influenza A Virus Nucleocapsid Protein with the Viral RNA Polymerase Potentiates Unprimed Viral RNA Replication , 2008, Journal of Virology.

[38]  S. Edwards,et al.  Posttranscriptional regulation of the breast cancer susceptibility gene BRCA1 by the RNA binding protein HuR. , 2008, Cancer research.

[39]  M. Gorospe,et al.  Posttranscriptional gene regulation by RNA-binding proteins during oxidative stress: implications for cellular senescence , 2008, Biological chemistry.

[40]  Xiuzhen Zhang,et al.  Large-scale prediction of long disordered regions in proteins using random forests , 2009, BMC Bioinformatics.

[41]  Monique E. Beaudoin,et al.  Regulating amyloid precursor protein synthesis through an internal ribosomal entry site , 2008, Nucleic acids research.

[42]  Zheng Yuan,et al.  Exploiting structural and topological information to improve prediction of RNA-protein binding sites , 2009, BMC Bioinformatics.

[43]  Xiao Sun,et al.  SVM-Based Approach for Predicting DNA-Binding Residues in Proteins from Amino Acid Sequences , 2009, 2009 International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing.

[44]  Kristian Vlahovicek,et al.  Prediction of Protein–Protein Interaction Sites in Sequences and 3D Structures by Random Forests , 2009, PLoS Comput. Biol..

[45]  Xiao Sun,et al.  Prediction of DNA-binding residues in proteins from amino acid sequences using a random forest model with a hybrid feature , 2008, Bioinform..

[46]  Zanxia Cao,et al.  Improve the prediction of RNA-binding residues using structural neighbours. , 2010, Protein and peptide letters.