A novel method for predicting RNA-interacting residues in proteins using a combination of feature-based and sequence template-based methods

Abstract RNA-binding proteins (RBPs) play a significant role in many cellular processes and regulation of gene expression, therefore, accurately identifying the RNA-interacting residues in protein sequences is crucial to detect the structure of RBPs and infer their function for new drug design. The protein sequence as basic information has been widely used in many protein researches with the combination of machine learning techniques. Here, we propose a sequence-based method to predict the RNA-protein interacting residues in protein sequences. The prediction method is composed of two predictors including a feature-based predictor and a sequence template-based predictor. The feature-based predictor applies the random forest (RF) classifier with the protein sequence information. After getting the classification probability, an adjustment procedure is used in consideration of neighbouring correlation between RNA interacting residues. The sequence template-based predictor selects the optimal template of the query sequence by multiple sequence alignment and matches the interacting residues in template sequence into the query sequence. With the combination of two predictors, the coverage and prediction performance of our method have been greatly improved, the MCC value increases from 0.467 and 0.352 to 0.499 in our validation set. In order to evaluate our proposed method, an independent testing set is utilized to compare with other two hybrid methods. As a result, our method achieves better performance than the other two methods with an overall accuracy of 0.817, an MCC value of 0.511 and an F-score of 0.605, which demonstrates that our method can reliably predict the RNA interacting residues in protein sequences. Moreover, the effectiveness of our newly proposed adjustment procedure in the feature-based predictor is examined and analyzed in detail.

[1]  D. Eisenberg,et al.  Correlation of sequence hydrophobicities measures similarity in three-dimensional protein structure. , 1983, Journal of molecular biology.

[2]  R. Jackson,et al.  Regulation of translation by specific protein/mRNA interactions. , 1994, Biochimie.

[3]  C. Sander,et al.  Correlated mutations and residue contacts in proteins , 1994, Proteins.

[4]  C. Sander,et al.  The prediction of protein contacts from multiple sequence alignments. , 1996, Protein engineering.

[5]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[6]  B. Rost,et al.  Effective use of sequence correlation and conservation in fold recognition. , 1999, Journal of molecular biology.

[7]  J L Sussman,et al.  The protein data bank. Bridging the gap between the sequence and 3D structure world. , 1999, Genetica.

[8]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[9]  F. Fabre,et al.  Homologous recombination is responsible for cell death in the absence of the Sgs1 and Srs2 helicases , 2000, Nature Genetics.

[10]  Richard W. Aldrich,et al.  A perturbation-based method for calculating explicit likelihood of evolutionary co-variance in multiple sequence alignments , 2004, Bioinform..

[11]  Cathy H. Wu,et al.  The Universal Protein Resource (UniProt) , 2004, Nucleic Acids Res..

[12]  C. V. van Blitterswijk,et al.  The effect of PEGT/PBT scaffold architecture on the composition of tissue engineered cartilage. , 2005, Biomaterials.

[13]  Liangjiang Wang,et al.  BindN: a web-based tool for efficient prediction of DNA and RNA binding sites in amino acid sequences , 2006, Nucleic Acids Res..

[14]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[15]  Gajendra P.S. Raghava,et al.  Prediction of RNA binding sites in a protein using SVM and PSSM profile , 2008, Proteins.

[16]  Wen-Lian Hsu,et al.  Predicting RNA-binding sites of proteins using support vector machines and evolutionary information , 2008, BMC Bioinformatics.

[17]  A. Sarai,et al.  Analysis of electric moments of RNA-binding proteins: implications for mechanism and prediction , 2011, BMC Structural Biology.

[18]  Xin Ma,et al.  Prediction of RNA‐binding residues in proteins from primary sequence using an enriched random forest model with a novel hybrid feature , 2011, Proteins.

[19]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[20]  Vasant Honavar,et al.  PRIDB: a protein–RNA interface database , 2010, Nucleic Acids Res..

[21]  Howard Y. Chang,et al.  Long intergenic noncoding RNAs: new links in cancer progression. , 2011, Cancer research.

[22]  Vasant Honavar,et al.  Protein-RNA interface residue prediction using machine learning: an assessment of the state of the art , 2012, BMC Bioinformatics.

[23]  A. Biegert,et al.  HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment , 2011, Nature Methods.

[24]  Anirban P. Mitra,et al.  A Central Role for Long Non-Coding RNA in Cancer , 2011, Front. Gene..

[25]  J. Bujnicki,et al.  Computational methods for prediction of protein-RNA interactions. , 2012, Journal of structural biology.

[26]  S. Janga,et al.  Dissecting the expression landscape of RNA-binding proteins in human cancers , 2014, Genome Biology.

[27]  G. Calin,et al.  miRNAs and long noncoding RNAs as biomarkers in human diseases , 2013, Expert review of molecular diagnostics.

[28]  J. Mattick,et al.  Long noncoding RNAs and the genetics of cancer , 2013, British Journal of Cancer.

[29]  Rasna R. Walia,et al.  RNABindRPlus: A Predictor that Combines Machine Learning and Sequence Homology-Based Methods to Improve the Reliability of Predicted RNA-Binding Residues in Proteins , 2014, PloS one.

[30]  Ying Shen,et al.  RNA-binding residues prediction using structural features , 2015, BMC Bioinformatics.

[31]  R. Wu,et al.  Computational Prediction of RNA-Binding Proteins and Binding Sites , 2015, International journal of molecular sciences.

[32]  Rong Liu,et al.  SNBRFinder: A Sequence-Based Hybrid Algorithm for Enhanced Prediction of Nucleic Acid-Binding Residues , 2015, PloS one.

[33]  Done Stojanov,et al.  TMO: time and memory optimized algorithm applicable for more accurate alignment of trinucleotide repeat disorders associated genes , 2016 .

[34]  Manuel D. Díaz-Muñoz,et al.  RNA-binding proteins control gene expression and cell fate in the immune system , 2018, Nature Immunology.

[35]  Qin Lu,et al.  EL_PSSM-RT: DNA-binding residue prediction by integrating ensemble learning with PSSM Relation Transformation , 2017, BMC Bioinformatics.

[36]  Bingqing Lin,et al.  Stability of methods for differential expression analysis of RNA-seq data , 2019, BMC Genomics.

[37]  Hui Liu,et al.  Long non-coding RNAs involved in cancer metabolic reprogramming , 2018, Cellular and Molecular Life Sciences.

[38]  K. Lole,et al.  Positive Regulation of Hepatitis E Virus Replication by MicroRNA-122 , 2018, Journal of Virology.

[39]  Kotb Abdelmohsen,et al.  Noncoding RNAs in Alzheimer's disease , 2018, Wiley interdisciplinary reviews. RNA.

[40]  Kirstyn T Carey,et al.  Regulatory Potential of the RNA Processing Machinery: Implications for Human Disease. , 2018, Trends in genetics : TIG.

[41]  Joshua L. Payne,et al.  RNA-mediated gene regulation is less evolvable than transcriptional regulation , 2018, Proceedings of the National Academy of Sciences.

[42]  Nicholas Y. Palermo,et al.  Discovery of a non‐nucleoside RNA polymerase inhibitor for blocking Zika virus replication through in silico screening , 2017, Antiviral research.

[43]  D. Wheeler,et al.  Framework for microRNA variant annotation and prioritization using human population and disease datasets , 2018, Human mutation.

[44]  Jianzhong Su,et al.  Analysis of long noncoding RNAs highlights region-specific altered expression patterns and diagnostic roles in Alzheimer's disease , 2019, Briefings Bioinform..