Deep-RBPPred: Predicting RNA binding proteins in the proteome scale based on deep learning

RNA binding protein (RBP) plays an important role in cellular processes. Identifying RBPs by computation and experiment are both essential. Recently, an RBP predictor, RBPPred, is proposed in our group to predict RBPs. However, RBPPred is too slow for that it needs to generate PSSM matrix as its feature. Herein, based on the protein feature of RBPPred and Convolutional Neural Network (CNN), we develop a deep learning model called Deep-RBPPred. With the balance and imbalance training set, we obtain Deep-RBPPred-balance and Deep-RBPPred-imbalance models. Deep-RBPPred has three advantages comparing to previous methods. (1) Deep-RBPPred only needs few physicochemical properties based on protein sequences. (2) Deep-RBPPred runs much faster. (3) Deep-RBPPred has a good generalization ability. In the meantime, Deep-RBPPred is still as good as the state-of-the-art method. Testing in A. thaliana, S. cerevisiae and H. sapiens proteomes, MCC values are 0.82 (0.82), 0.65 (0.69) and 0.85 (0.80) for balance model (imbalance model) when the score cutoff is set to 0.5, respectively. In the same testing dataset, different machine learning algorithms (CNN and SVM) are also compared. The results show that CNN-based model can identify more RBPs than SVM-based. In comparing the balance and imbalance model, both CNN-base and SVM-based tend to favor the majority class in the imbalance set. Deep-RBPPred forecasts 280 (balance model) and 265 (imbalance model) of 299 new RBP. The sensitivity of balance model is about 7% higher than the state-of-the-art method. We also apply deep-RBPPred to 30 eukaryotes and 109 bacteria proteomes downloaded from Uniprot to estimate all possible RBPs. The estimating result shows that rates of RBPs in eukaryote proteomes are much higher than bacteria proteomes.

[1]  Konrad U. Förstner,et al.  APRICOT: an integrated computational pipeline for the sequence-based identification and characterization of RNA-binding proteins , 2016, bioRxiv.

[2]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[3]  Scott B. Dewell,et al.  Transcriptome-wide Identification of RNA-Binding Protein and MicroRNA Target Sites by PAR-CLIP , 2010, Cell.

[4]  M. Hentze,et al.  Identification of RNA-binding Proteins in Macrophages by Interactome Capture* , 2016, Molecular & Cellular Proteomics.

[5]  Jeroen Krijgsveld,et al.  The RNA-binding proteomes from yeast to man harbour conserved enigmRBPs , 2015, Nature Communications.

[6]  Vasant Honavar,et al.  Predicting RNA-Protein Interactions Using Only Sequence Information , 2011, BMC Bioinformatics.

[7]  Gene W. Yeo,et al.  SONAR Discovers RNA-Binding Proteins from Analysis of Large-Scale Protein-Protein Interactomes. , 2016, Molecular cell.

[8]  Xuegong Zhang,et al.  Computational prediction of associations between long non-coding RNAs and proteins , 2013, BMC Genomics.

[9]  Yaoqi Zhou,et al.  Structure-based prediction of RNA-binding domains and RNA-binding sites and application to structural genomics targets , 2010, Nucleic acids research.

[10]  Yael Mandel-Gutfreund,et al.  BindUP: a web server for non-homology-based prediction of DNA and RNA binding proteins , 2016, Nucleic Acids Res..

[11]  Yuedong Yang,et al.  Highly accurate and high-resolution function prediction of RNA binding proteins by fold recognition and binding affinity prediction , 2011, RNA biology.

[12]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[13]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[14]  E. Bunnik,et al.  The mRNA-bound proteome of the human malaria parasite Plasmodium falciparum , 2016, Genome Biology.

[15]  Xiaoli Zhang,et al.  RBPPred: predicting RNA‐binding proteins from sequence using SVM , 2016, Bioinform..

[16]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[17]  V. Suresh,et al.  RPI-Pred: predicting ncRNA-protein interaction using sequence and structural information , 2015, Nucleic acids research.

[18]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[19]  Jeroen Krijgsveld,et al.  The Cardiomyocyte RNA-Binding Proteome: Links to Intermediary Metabolism and Heart Disease , 2016, Cell reports.

[20]  Guoli Wang,et al.  PISCES: a protein sequence culling server , 2003, Bioinform..

[21]  Roy Parker,et al.  Global Analysis of Yeast mRNPs , 2012, Nature Structural &Molecular Biology.

[22]  Gajendra P S Raghava,et al.  SVM based prediction of RNA‐binding proteins using binding residues and evolutionary information , 2011, Journal of molecular recognition : JMR.

[23]  David R. Kelley,et al.  Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks , 2015, bioRxiv.

[24]  J. Ule,et al.  iCLIP reveals the function of hnRNP particles in splicing at individual nucleotide resolution , 2010, Nature Structural &Molecular Biology.

[25]  Gene W. Yeo,et al.  Robust transcriptome-wide discovery of RNA binding protein binding sites with enhanced CLIP (eCLIP) , 2016, Nature Methods.

[26]  Jeroen Krijgsveld,et al.  The RNA-binding protein repertoire of embryonic stem cells , 2013, Nature Structural &Molecular Biology.

[27]  Norman E. Davey,et al.  Insights into RNA Biology from an Atlas of Mammalian mRNA-Binding Proteins , 2012, Cell.

[28]  María Martín,et al.  UniProt: A hub for protein information , 2015 .

[29]  The Uniprot Consortium,et al.  UniProt: a hub for protein information , 2014, Nucleic Acids Res..

[30]  E. Laing,et al.  Conserved mRNA-binding proteomes in eukaryotic organisms , 2015, Nature Structural &Molecular Biology.

[31]  David K. Gifford,et al.  Convolutional neural network architectures for predicting DNA–protein binding , 2016, Bioinform..

[32]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[33]  Yaoqi Zhou,et al.  A new size‐independent score for pairwise protein structure alignment and its application to structure classification and nucleic‐acid binding prediction , 2012, Proteins.

[34]  Jianyang Zeng,et al.  A deep learning framework for modeling structural features of RNA-binding protein targets , 2015, Nucleic acids research.

[35]  O. Troyanskaya,et al.  Predicting effects of noncoding variants with deep learning–based sequence model , 2015, Nature Methods.

[36]  Tyson A. Clark,et al.  HITS-CLIP yields genome-wide insights into brain alternative RNA processing , 2008, Nature.

[37]  Xing Chen,et al.  Quantitative time-resolved chemoproteomics reveals that stable O-GlcNAc regulates box C/D snoRNP biogenesis , 2017, Proceedings of the National Academy of Sciences.

[38]  Hong-Bin Shen,et al.  Predicting RNA‐protein binding sites and motifs through combining local and global deep convolutional neural networks , 2018, Bioinform..

[39]  Federico Agostini,et al.  Predicting protein associations with long noncoding RNAs , 2011, Nature Methods.

[40]  Richard Bonneau,et al.  The mRNA-bound proteome and its global occupancy profile on protein-coding transcripts. , 2012, Molecular cell.

[41]  Hong-Bin Shen,et al.  RNA-protein binding motifs mining with a new hybrid deep learning based cross-domain knowledge integration approach , 2016, BMC Bioinformatics.

[42]  M. Selbach,et al.  The mRNA-bound proteome of the early fly embryo , 2016, Genome research.