RNABindRPlus: A Predictor that Combines Machine Learning and Sequence Homology-Based Methods to Improve the Reliability of Predicted RNA-Binding Residues in Proteins

Protein-RNA interactions are central to essential cellular processes such as protein synthesis and regulation of gene expression and play roles in human infectious and genetic diseases. Reliable identification of protein-RNA interfaces is critical for understanding the structural bases and functional implications of such interactions and for developing effective approaches to rational drug design. Sequence-based computational methods offer a viable, cost-effective way to identify putative RNA-binding residues in RNA-binding proteins. Here we report two novel approaches: (i) HomPRIP, a sequence homology-based method for predicting RNA-binding sites in proteins; (ii) RNABindRPlus, a new method that combines predictions from HomPRIP with those from an optimized Support Vector Machine (SVM) classifier trained on a benchmark dataset of 198 RNA-binding proteins. Although highly reliable, HomPRIP cannot make predictions for the unaligned parts of query proteins and its coverage is limited by the availability of close sequence homologs of the query protein with experimentally determined RNA-binding sites. RNABindRPlus overcomes these limitations. We compared the performance of HomPRIP and RNABindRPlus with that of several state-of-the-art predictors on two test sets, RB44 and RB111. On a subset of proteins for which homologs with experimentally determined interfaces could be reliably identified, HomPRIP outperformed all other methods achieving an MCC of 0.63 on RB44 and 0.83 on RB111. RNABindRPlus was able to predict RNA-binding residues of all proteins in both test sets, achieving an MCC of 0.55 and 0.37, respectively, and outperforming all other methods, including those that make use of structure-derived features of proteins. More importantly, RNABindRPlus outperforms all other methods for any choice of tradeoff between precision and recall. An important advantage of both HomPRIP and RNABindRPlus is that they rely on readily available sequence and sequence-derived features of RNA-binding proteins. A webserver implementation of both methods is freely available at http://einstein.cs.iastate.edu/RNABindRPlus/.

[1]  S. Cessie,et al.  Ridge Estimators in Logistic Regression , 1992 .

[2]  R. Jackson,et al.  Regulation of translation by specific protein/mRNA interactions. , 1994, Biochimie.

[3]  Gapped BLAST and PSI-BLAST: A new , 1997 .

[4]  Miguel A. Andrade-Navarro Position-Specific Annotation of Protein Function Based on Multiple Homologs , 1999, ISMB.

[5]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[6]  Pierre Baldi,et al.  Assessing the accuracy of prediction algorithms for classification: an overview , 2000, Bioinform..

[7]  A. Sali,et al.  Comparative protein structure modeling of genes and genomes. , 2000, Annual review of biophysics and biomolecular structure.

[8]  M. Vidal,et al.  Identification of potential interaction networks using sequence-based searches for conserved protein-protein interactions or "interologs". , 2001, Genome research.

[9]  Guoli Wang,et al.  PISCES: a protein sequence culling server , 2003, Bioinform..

[10]  Günther Zehetner,et al.  OntoBlast function: from sequence similarities directly to potential functional annotations by ontology terms , 2003, Nucleic Acids Res..

[11]  Satoru Miyano,et al.  A neural network method for identification of RNA-interacting residues in protein. , 2004, Genome informatics. International Conference on Genome Informatics.

[12]  J. Doudna,et al.  Crystallization of RNA and RNA-protein complexes. , 2004, Methods.

[13]  T. Hughes,et al.  Genome-Wide Analysis of mRNA Stability Using Transcription Inhibitors and Microarrays Reveals Posttranscriptional Control of Ribosome Biogenesis Factors , 2004, Molecular and Cellular Biology.

[14]  Thomas Lengauer,et al.  ROCR: visualizing classifier performance in R , 2005, Bioinform..

[15]  Jernej Ule,et al.  CLIP: a method for identifying protein-RNA interaction sites in living cells. , 2005, Methods.

[16]  J. Feigon,et al.  Structure determination of protein/RNA complexes by NMR. , 2005, Methods in enzymology.

[17]  B. Blencowe Alternative Splicing: New Insights from Global Analyses , 2006, Cell.

[18]  Liangjiang Wang,et al.  BindN: a web-based tool for efficient prediction of DNA and RNA binding sites in amino acid sequences , 2006, Nucleic Acids Res..

[19]  N. Go,et al.  Amino acid residue doublet propensity in the protein–RNA interface and its application to RNA interface prediction , 2006, Nucleic acids research.

[20]  Satoru Miyano,et al.  A Weighted Profile Based Method for Protein-RNA Interacting Residue Prediction , 2006, Trans. Comp. Sys. Biology.

[21]  Vasant G Honavar,et al.  Prediction of RNA binding sites in proteins from amino acid sequence. , 2006, RNA.

[22]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[23]  Susan J. Brown,et al.  Prediction of RNA-Binding Residues in Protein Sequences Using Support Vector Machines , 2006, 2006 International Conference of the IEEE Engineering in Medicine and Biology Society.

[24]  L. Hellman,et al.  Electrophoretic mobility shift assay (EMSA) for detecting protein–nucleic acid interactions , 2007, Nature Protocols.

[25]  Jae-Hyung Lee,et al.  RNABindR: a server for analyzing and predicting RNA-binding sites in proteins , 2007, Nucleic Acids Res..

[26]  Vasant Honavar,et al.  Assessing the Performance of Macromolecular Sequence Classifiers , 2007, 2007 IEEE 7th International Symposium on BioInformatics and BioEngineering.

[27]  Timothy R Hughes,et al.  SMAUG is a major regulator of maternal mRNA destabilization in Drosophila and its translation is activated by the PAN GU kinase. , 2007, Developmental cell.

[29]  Gajendra P.S. Raghava,et al.  Prediction of RNA binding sites in a protein using SVM and PSSM profile , 2008, Proteins.

[30]  Kyungsook Han,et al.  Prediction of RNA-Binding Residues in Proteins Using the Interaction Propensities of Amino Acids and Nucleotides , 2008, ICIC.

[31]  Wen-Lian Hsu,et al.  Predicting RNA-binding sites of proteins using support vector machines and evolutionary information , 2008, BMC Bioinformatics.

[32]  M. Muers RNA splicing: Counting, coordinating and controlling the alternatives , 2008, Nature Reviews Genetics.

[33]  M. Denison Seeking Membranes: Positive-Strand RNA Virus Replication Complexes , 2008, PLoS biology.

[34]  Zheng Yuan,et al.  Exploiting structural and topological information to improve prediction of RNA-protein binding sites , 2009, BMC Bioinformatics.

[35]  Haruki Nakamura,et al.  Protein function annotation from sequence: prediction of residues interacting with RNA , 2009, Bioinform..

[36]  J. Pelletier,et al.  High-throughput assays probing protein-RNA interactions of eukaryotic translation initiation factors. , 2009, Analytical biochemistry.

[37]  George Karypis,et al.  LIBRUS: combined machine learning and homology information for sequence-based ligand-binding residue prediction , 2009, Bioinform..

[38]  Susan Jones,et al.  RNA-binding residues in sequence space: Conservation and interaction patterns , 2009, Comput. Biol. Chem..

[39]  Dusanka Janezic,et al.  ProBiS algorithm for detection of structurally similar protein binding sites by local structural alignment , 2010, Bioinform..

[40]  Vasant Honavar,et al.  Struct-NB: predicting protein-RNA binding sites using structural features , 2010, Int. J. Data Min. Bioinform..

[41]  Hui Lu,et al.  NAPS: a residue-level nucleic acid-binding prediction server , 2010, Nucleic Acids Res..

[42]  Vasant Honavar,et al.  HomPPI: a class of sequence homology based protein-protein interface prediction methods , 2011, BMC Bioinformatics.

[43]  L. Perez-Cano,et al.  Optimal protein‐RNA area, OPRA: A propensity‐based method to identify RNA‐binding sites on proteins , 2010, Proteins.

[44]  J. Rinn,et al.  Large non-coding RNAs: missing links in cancer? , 2010, Human molecular genetics.

[45]  T. Tuschl,et al.  Structural and functional insights into pattern recognition by the innate immune receptor RIG-I , 2010, Nature Structural &Molecular Biology.

[46]  Meng-long Li,et al.  Identification of RNA-binding sites in proteins by integrating various sequence information , 2010, Amino Acids.

[47]  Jack Y. Yang,et al.  BindN+ for accurate prediction of DNA and RNA-binding residues from protein sequence features , 2010, BMC Systems Biology.

[48]  Raquel Norel,et al.  Protein interface conservation across structure space , 2010, Proceedings of the National Academy of Sciences.

[49]  Haruki Nakamura,et al.  PiRaNhA: a server for the computational prediction of RNA-binding residues in protein sequences , 2010, Nucleic Acids Res..

[50]  Vasant Honavar,et al.  PRIDB: a protein–RNA interface database , 2010, Nucleic Acids Res..

[51]  Howard Y. Chang,et al.  Long intergenic noncoding RNAs: new links in cancer progression. , 2011, Cancer research.

[52]  Yaoqi Zhou,et al.  Structure-based prediction of RNA-binding domains and RNA-binding sites and application to structural genomics targets , 2010, Nucleic acids research.

[53]  Yang Zhang,et al.  Protein-protein complex structure predictions by multimeric threading and template recombination. , 2011, Structure.

[54]  Ahmad M Khalil,et al.  RNA-protein interactions in human health and disease. , 2011, Seminars in cell & developmental biology.

[55]  Vasant Honavar,et al.  Predicting protein-protein interface residues using local surface structural similarity , 2012, BMC Bioinformatics.

[56]  M. Esteller Non-coding RNAs in human disease , 2011, Nature Reviews Genetics.

[57]  Jihong Guan,et al.  PredUs: a web server for predicting protein interfaces using structural neighbors , 2011, Nucleic Acids Res..

[58]  Vasant Honavar,et al.  Protein-RNA interface residue prediction using machine learning: an assessment of the state of the art , 2012, BMC Bioinformatics.

[59]  A. Iwasaki A virological view of innate immune recognition. , 2012, Annual review of microbiology.

[60]  Zhengwei Zhu,et al.  CD-HIT: accelerated for clustering the next-generation sequencing data , 2012, Bioinform..

[61]  C. Basler,et al.  Molecular mechanisms of viral inhibitors of RIG-I-like receptors. , 2012, Trends in microbiology.

[62]  Anirban P. Mitra,et al.  A Central Role for Long Non-Coding RNA in Cancer , 2011, Front. Gene..

[63]  Jürgen Götz,et al.  Decoding the non-coding RNAs in Alzheimer’s disease , 2012, Cellular and Molecular Life Sciences.

[64]  Nan Hu,et al.  Non-coding RNAs in Alzheimer's Disease , 2012, Molecular Neurobiology.

[65]  P. D. Nagy,et al.  The dependence of viral RNA replication on co-opted host factors , 2011, Nature Reviews Microbiology.

[66]  J. Bujnicki,et al.  Computational methods for prediction of protein-RNA interactions. , 2012, Journal of structural biology.

[67]  S. Janga,et al.  Dissecting the expression landscape of RNA-binding proteins in human cancers , 2014, Genome Biology.

[68]  G. Calin,et al.  miRNAs and long noncoding RNAs as biomarkers in human diseases , 2013, Expert review of molecular diagnostics.

[69]  J. Mattick,et al.  Long noncoding RNAs and the genetics of cancer , 2013, British Journal of Cancer.

[70]  Vasant Honavar,et al.  Boundaries of Safe, Twilight, and Dark Zones used by HomPRIP. , 2014 .

[71]  Vasant Honavar,et al.  DockRank: Ranking docked conformations using partner‐specific sequence homology‐based protein interface prediction , 2014, Proteins.