Protein-RNA interface residue prediction using machine learning: an assessment of the state of the art

BackgroundRNA molecules play diverse functional and structural roles in cells. They function as messengers for transferring genetic information from DNA to proteins, as the primary genetic material in many viruses, as catalysts (ribozymes) important for protein synthesis and RNA processing, and as essential and ubiquitous regulators of gene expression in living organisms. Many of these functions depend on precisely orchestrated interactions between RNA molecules and specific proteins in cells. Understanding the molecular mechanisms by which proteins recognize and bind RNA is essential for comprehending the functional implications of these interactions, but the recognition ‘code’ that mediates interactions between proteins and RNA is not yet understood. Success in deciphering this code would dramatically impact the development of new therapeutic strategies for intervening in devastating diseases such as AIDS and cancer. Because of the high cost of experimental determination of protein-RNA interfaces, there is an increasing reliance on statistical machine learning methods for training predictors of RNA-binding residues in proteins. However, because of differences in the choice of datasets, performance measures, and data representations used, it has been difficult to obtain an accurate assessment of the current state of the art in protein-RNA interface prediction.ResultsWe provide a review of published approaches for predicting RNA-binding residues in proteins and a systematic comparison and critical assessment of protein-RNA interface residue predictors trained using these approaches on three carefully curated non-redundant datasets. We directly compare two widely used machine learning algorithms (Naïve Bayes (NB) and Support Vector Machine (SVM)) using three different data representations in which features are encoded using either sequence- or structure-based windows. Our results show that (i) Sequence-based classifiers that use a position-specific scoring matrix (PSSM)-based representation (PSSMSeq) outperform those that use an amino acid identity based representation (IDSeq) or a smoothed PSSM (SmoPSSMSeq); (ii) Structure-based classifiers that use smoothed PSSM representation (SmoPSSMStr) outperform those that use PSSM (PSSMStr) as well as sequence identity based representation (IDStr). PSSMSeq classifiers, when tested on an independent test set of 44 proteins, achieve performance that is comparable to that of three state-of-the-art structure-based predictors (including those that exploit geometric features) in terms of Matthews Correlation Coefficient (MCC), although the structure-based methods achieve substantially higher Specificity (albeit at the expense of Sensitivity) compared to sequence-based methods. We also find that the expected performance of the classifiers on a residue level can be markedly different from that on a protein level. Our experiments show that the classifiers trained on three different non-redundant protein-RNA interface datasets achieve comparable cross-validation performance. However, we find that the results are significantly affected by differences in the distance threshold used to define interface residues.ConclusionsOur results demonstrate that protein-RNA interface residue predictors that use a PSSM-based encoding of sequence windows outperform classifiers that use other encodings of sequence windows. While structure-based methods that exploit geometric features can yield significant increases in the Specificity of protein-RNA interface residue predictions, such increases are offset by decreases in Sensitivity. These results underscore the importance of comparing alternative methods using rigorous statistical procedures, multiple performance measures, and datasets that are constructed based on several alternative definitions of interface residues and redundancy cutoffs as well as including evaluations on independent test sets into the comparisons.

[1]  Vasant Honavar,et al.  Struct-NB: predicting protein-RNA binding sites using structural features , 2010, Int. J. Data Min. Bioinform..

[2]  Aleksey A. Porollo,et al.  Combining prediction of secondary structure and solvent accessibility in proteins , 2005, Proteins.

[3]  Yaoqi Zhou,et al.  Structure-based prediction of RNA-binding domains and RNA-binding sites and application to structural genomics targets , 2010, Nucleic acids research.

[4]  Burkhard Rost,et al.  Prediction of DNA-binding residues from sequence , 2007, ISMB/ECCB.

[5]  Liangjiang Wang,et al.  BindN: a web-based tool for efficient prediction of DNA and RNA binding sites in amino acid sequences , 2006, Nucleic Acids Res..

[6]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[7]  Meng-long Li,et al.  Identification of RNA-binding sites in proteins by integrating various sequence information , 2010, Amino Acids.

[8]  R. Aldrich,et al.  Influence of conservation on calculations of amino acid covariance in multiple sequence alignments , 2004, Proteins.

[9]  L. Hellman,et al.  Electrophoretic mobility shift assay (EMSA) for detecting protein–nucleic acid interactions , 2007, Nature Protocols.

[10]  Daniel Herschlag,et al.  Diverse RNA-Binding Proteins Interact with Functionally Related Sets of RNAs, Suggesting an Extensive Regulatory System , 2008, PLoS biology.

[11]  Gajendra P.S. Raghava,et al.  Prediction of RNA binding sites in a protein using SVM and PSSM profile , 2008, Proteins.

[12]  Zanxia Cao,et al.  Improve the prediction of RNA-binding residues using structural neighbours. , 2010, Protein and peptide letters.

[13]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[14]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[15]  Hanah Margalit,et al.  Persistently conserved positions in structurally similar, sequence dissimilar proteins: Roles in preserving protein fold and function , 2002, Protein science : a publication of the Protein Society.

[16]  Pierre Baldi,et al.  Assessing the accuracy of prediction algorithms for classification: an overview , 2000, Bioinform..

[17]  Susan Jones,et al.  RNA-binding residues in sequence space: Conservation and interaction patterns , 2009, Comput. Biol. Chem..

[18]  S. Jones,et al.  Protein-RNA interactions: a structural analysis. , 2001, Nucleic acids research.

[19]  Jernej Ule,et al.  CLIP: a method for identifying protein-RNA interaction sites in living cells. , 2005, Methods.

[20]  Donny D. Licatalosi,et al.  RNA processing and its regulation: global insights into biological networks , 2010, Nature Reviews Genetics.

[21]  Tuo Zhang,et al.  Analysis and prediction of RNA-binding residues using sequence, evolutionary conservation, and predicted secondary structure and solvent accessibility. , 2010, Current protein & peptide science.

[22]  Shula Shazman,et al.  From face to interface recognition: a differential geometric approach to distinguish DNA from RNA binding surfaces , 2011, Nucleic acids research.

[23]  E. Izaurralde,et al.  Gene silencing by microRNAs: contributions of translational repression and mRNA decay , 2011, Nature Reviews Genetics.

[24]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[25]  Vasant Honavar,et al.  PRIDB: a protein–RNA interface database , 2010, Nucleic Acids Res..

[26]  Vasant G Honavar,et al.  Prediction of RNA binding sites in proteins from amino acid sequence. , 2006, RNA.

[27]  Simon J. Hubbard,et al.  Department of Biochemistry and Molecular Biology , 2006 .

[28]  Satoru Miyano,et al.  A Weighted Profile Based Method for Protein-RNA Interacting Residue Prediction , 2006, Trans. Comp. Sys. Biology.

[29]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[30]  Peng Jiang,et al.  RISP: A web-based server for prediction of RNA-binding sites in proteins , 2008, Comput. Methods Programs Biomed..

[31]  Satoru Miyano,et al.  A neural network method for identification of RNA-interacting residues in protein. , 2004, Genome informatics. International Conference on Genome Informatics.

[32]  J. Bujnicki,et al.  Computational methods for prediction of protein-RNA interactions. , 2012, Journal of structural biology.

[33]  Shandar Ahmad,et al.  PSSM-based prediction of DNA binding sites in proteins , 2005, BMC Bioinformatics.

[34]  E Westhof,et al.  Statistical analysis of atomic contacts at RNA–protein interfaces , 2001, Journal of molecular recognition : JMR.

[35]  Jonathan J. Ellis,et al.  Protein–RNA interactions: Structural analysis and functional classes , 2006, Proteins.

[36]  Yael Mandel-Gutfreund,et al.  Classifying RNA-Binding Proteins Based on Electrostatic Properties , 2008, PLoS Comput. Biol..

[37]  Xue-wen Chen,et al.  On Position-Specific Scoring Matrix for Protein Function Prediction , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[38]  Y. Shamoo,et al.  Structure-based analysis of protein-RNA interactions using the program ENTANGLE. , 2001, Journal of molecular biology.

[39]  R. Graham,et al.  Predicting RNA-binding sites from the protein structure based on electrostatics, evolution and geometry , 2008, Nucleic acids research.

[40]  Thomas Lengauer,et al.  ROCR: visualizing classifier performance in R , 2005, Bioinform..

[41]  O. Lichtarge,et al.  Evolutionary predictions of binding surfaces and interactions. , 2002, Current opinion in structural biology.

[42]  W. Filipowicz,et al.  Regulation of mRNA translation and stability by microRNAs. , 2010, Annual review of biochemistry.

[43]  Jagath C Rajapakse,et al.  Two‐stage support vector regression approach for predicting accessible surface areas of amino acids , 2006, Proteins.

[44]  Haruki Nakamura,et al.  Protein function annotation from sequence: prediction of residues interacting with RNA , 2009, Bioinform..

[45]  Zhi-Ping Liu,et al.  Prediction of protein-RNA binding sites by a random forest method with combined features , 2010, Bioinform..

[46]  Marc Toussaint,et al.  Probabilistic inference for solving discrete and continuous state Markov Decision Processes , 2006, ICML.

[47]  S. Sathiya Keerthi,et al.  Improvements to Platt's SMO Algorithm for SVM Classifier Design , 2001, Neural Computation.

[48]  Yu-Feng Huang,et al.  Predicting RNA-binding residues from evolutionary information and sequence conservation , 2010, BMC Genomics.

[49]  L. Perez-Cano,et al.  Optimal protein‐RNA area, OPRA: A propensity‐based method to identify RNA‐binding sites on proteins , 2010, Proteins.

[50]  Harpreet Kaur,et al.  Real value prediction of solvent accessibility in proteins using multiple sequence alignment and secondary structure , 2005, Proteins.

[51]  David T. Jones,et al.  Prediction of disordered regions in proteins from position specific score matrices , 2003, Proteins.

[52]  Wen-Lian Hsu,et al.  Predicting RNA-binding sites of proteins using support vector machines and evolutionary information , 2008, BMC Bioinformatics.

[53]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[54]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[55]  Y. Wang,et al.  PRINTR: Prediction of RNA binding sites in proteins using SVM and profiles , 2008, Amino Acids.

[56]  K-L Ting,et al.  Combining the GOR V algorithm with evolutionary information for protein secondary structure prediction from amino acid sequence , 2002, Proteins.

[57]  Vasant Honavar,et al.  Assessing the Performance of Macromolecular Sequence Classifiers , 2007, 2007 IEEE 7th International Symposium on BioInformatics and BioEngineering.

[58]  N. Go,et al.  Amino acid residue doublet propensity in the protein–RNA interface and its application to RNA interface prediction , 2006, Nucleic acids research.

[59]  Jae-Hyung Lee,et al.  Genetics, Development and Cell Biology Publications Genetics, Development and Cell Biology Rnabindr: a Server for Analyzing and Predicting Rna-binding Sites in Proteins Rnabindr: a Server for Analyzing and Predicting Rna-binding Sites in Proteins , 2022 .

[60]  Pierre Baldi,et al.  Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles , 2002, Proteins.

[61]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[62]  BMC Bioinformatics , 2005 .

[63]  Kyungsook Han,et al.  Computational analysis of hydrogen bonds in protein–RNA complexes for interaction patterns , 2003, FEBS letters.

[64]  Laura Pérez-Cano,et al.  Dissection and prediction of RNA-binding sites on proteins , 2010, Biomolecular concepts.

[65]  Susan J. Brown,et al.  Prediction of RNA-Binding Residues in Protein Sequences Using Support Vector Machines , 2006, 2006 International Conference of the IEEE Engineering in Medicine and Biology Society.

[66]  Hui Lu,et al.  NAPS: a residue-level nucleic acid-binding prediction server , 2010, Nucleic Acids Res..

[67]  Mona Singh,et al.  Predicting functionally important residues from sequence conservation , 2007, Bioinform..

[68]  Zheng Yuan,et al.  Exploiting structural and topological information to improve prediction of RNA-protein binding sites , 2009, BMC Bioinformatics.

[69]  Quan Pan,et al.  Identification of protein-RNA interaction sites using the information of spatial adjacent residues , 2011, Proteome Science.

[70]  Kentaro Shimizu,et al.  Prediction of Protein-Protein Interaction Sites Using Only Sequence Information and Using Both Sequence and Structural Information , 2008 .

[71]  M. Friedman A Comparison of Alternative Tests of Significance for the Problem of $m$ Rankings , 1940 .

[72]  A. Shelat,et al.  Assay Optimization and Screening of RNA-Protein Interactions by AlphaScreen , 2007, Journal of biomolecular screening.

[73]  Jack Y. Yang,et al.  BindN+ for accurate prediction of DNA and RNA-binding residues from protein sequence features , 2010, BMC Systems Biology.

[74]  Xin Ma,et al.  Prediction of RNA‐binding residues in proteins from primary sequence using an enriched random forest model with a novel hybrid feature , 2011, Proteins.