Prediction of RNA binding sites in proteins from amino acid sequence.

RNA-protein interactions are vitally important in a wide range of biological processes, including regulation of gene expression, protein synthesis, and replication and assembly of many viruses. We have developed a computational tool for predicting which amino acids of an RNA binding protein participate in RNA-protein interactions, using only the protein sequence as input. RNABindR was developed using machine learning on a validated nonredundant data set of interfaces from known RNA-protein complexes in the Protein Data Bank. It generates a classifier that captures primary sequence signals sufficient for predicting which amino acids in a given protein are located in the RNA-protein interface. In leave-one-out cross-validation experiments, RNABindR identifies interface residues with >85% overall accuracy. It can be calibrated by the user to obtain either high specificity or high sensitivity for interface residues. RNABindR, implementing a Naive Bayes classifier, performs as well as a more complex neural network classifier (to our knowledge, the only previously published sequence-based method for RNA binding site prediction) and offers the advantages of speed, simplicity and interpretability of results. RNABindR predictions on the human telomerase protein hTERT are in good agreement with experimental data. The availability of computational tools for predicting which residues in an RNA binding protein are likely to contact RNA should facilitate design of experiments to directly test RNA binding function and contribute to our understanding of the diversity, mechanisms, and regulation of RNA-protein complexes in biological systems. (RNABindR is available as a Web tool from http://bindr.gdcb.iastate.edu.).

[1]  R. Doolittle,et al.  A simple method for displaying the hydropathic character of a protein. , 1982, Journal of molecular biology.

[2]  D. Eisenberg,et al.  The hydrophobic moment detects periodicity in protein hydrophobicity. , 1984, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Wray L. Buntine Theory Refinement on Bayesian Networks , 1991, UAI.

[4]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[5]  Robert L. Jernigan,et al.  RNA base-amino acid interaction strengths derived from structures and sequences , 1997, Nucleic Acids Res..

[6]  M. Weiss,et al.  RNA recognition by arginine‐rich peptide motifs , 1998, Biopolymers.

[7]  S C Schultz,et al.  Molecular basis of double‐stranded RNA‐protein interactions: structure of a dsRNA‐binding domain complexed with dsRNA , 1998, The EMBO journal.

[8]  D. Draper Themes in RNA-protein recognition. , 1999, Journal of molecular biology.

[9]  David S. Eisenberg,et al.  Finding families for genomic ORFans , 1999, Bioinform..

[10]  S. Cusack RNA-protein complexes. , 1999, Current opinion in structural biology.

[11]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[12]  Pierre Baldi,et al.  Assessing the accuracy of prediction algorithms for classification: an overview , 2000, Bioinform..

[13]  T. Cech,et al.  Telomerase RNA bound by protein motifs specific to telomerase reverse transcriptase. , 2000, Molecular cell.

[14]  F. Bachand,et al.  Functional Regions of Human Telomerase Reverse Transcriptase and Human Telomerase RNA Required for Telomerase Activity and RNA-Protein Interactions , 2001, Molecular and Cellular Biology.

[15]  T. Steitz,et al.  The kink‐turn: a new RNA secondary structure motif , 2001, The EMBO journal.

[16]  James R. Mitchell,et al.  RNA Binding Domain of Telomerase Reverse Transcriptase , 2001, Molecular and Cellular Biology.

[17]  Y. Shamoo,et al.  Structure-based analysis of protein-RNA interactions using the program ENTANGLE. , 2001, Journal of molecular biology.

[18]  S. Jones,et al.  Protein-RNA interactions: a structural analysis. , 2001, Nucleic acids research.

[19]  K. Hall,et al.  RNA-protein interactions. , 2002, Current opinion in structural biology.

[20]  K. Collins,et al.  Template boundary definition in Tetrahymena telomerase. , 2002, Genes & development.

[21]  S. Dupuis,et al.  Functional Multimerization of Human Telomerase Requires an RNA Interaction Domain in the N Terminus of the Catalytic Subunit , 2002, Molecular and Cellular Biology.

[22]  Guoli Wang,et al.  PISCES: a protein sequence culling server , 2003, Bioinform..

[23]  B. Rost,et al.  Predicted protein–protein interaction sites from local sequence information , 2003, FEBS letters.

[24]  Demetri T. Moustakas,et al.  Structure of tRNA pseudouridine synthase TruB and its RNA complex: RNA recognition through a combination of rigid docking and induced fit , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[25]  W. Weissenhorn,et al.  The Matrix Protein VP40 from Ebola Virus Octamerizes into Pore-like Structures with Specific RNA Binding Properties , 2003, Structure.

[26]  Alex Bateman,et al.  The InterPro Database, 2003 brings increased coverage and new features , 2003, Nucleic Acids Res..

[27]  Kyungsook Han,et al.  Computational analysis of hydrogen bonds in protein–RNA complexes for interaction patterns , 2003, FEBS letters.

[28]  B. Rost,et al.  Automatic prediction of protein function , 2003, Cellular and Molecular Life Sciences CMLS.

[29]  Yu-dong Cai,et al.  Support vector machines for predicting rRNA-, RNA-, and DNA-binding proteins from amino acid sequence. , 2003, Biochimica et biophysica acta.

[30]  Amos Bairoch,et al.  Recent improvements to the PROSITE database , 2004, Nucleic Acids Res..

[31]  Michael M Hoffman,et al.  AANT: the Amino Acid-Nucleotide Interaction Database. , 2004, Nucleic acids research.

[32]  C. Autexier,et al.  Functional Organization of Repeat Addition Processivity and DNA Synthesis Determinants in the Human Telomerase Multimer , 2004, Molecular and Cellular Biology.

[33]  Vasant Honavar,et al.  A two-stage classifier for identification of protein-protein interface residues , 2004, ISMB/ECCB.

[34]  Satoru Miyano,et al.  A neural network method for identification of RNA-interacting residues in protein. , 2004, Genome informatics. International Conference on Genome Informatics.

[35]  Michael B. Mathews,et al.  The double-stranded-RNA-binding motif: interference and much more , 2004, Nature Reviews Molecular Cell Biology.

[36]  Daniel Fischer,et al.  Structural biology sheds light on the puzzle of genomic ORFans. , 2004, Journal of molecular biology.

[37]  Vasant Honavar,et al.  Identification of interface residues in protease-inhibitor and antigen-antibody complexes: a support vector machine approach , 2004, Neural Computing & Applications.

[38]  Robin Taylor,et al.  Combined use of physicochemical data and small-molecule crystallographic contact propensities to predict interactions in protein binding sites. , 2004, Organic & biomolecular chemistry.

[39]  Vasant Honavar,et al.  Predicting binding sites of hydrolase-inhibitor complexes by combining several methods , 2004, BMC Bioinformatics.

[40]  Yu Zong Chen,et al.  Prediction of RNA-binding proteins from primary sequence by a support vector machine approach. , 2004, RNA.

[41]  R. Raz,et al.  ProMate: a structure based prediction program to identify the location of protein-protein binding sites. , 2004, Journal of molecular biology.

[42]  Janet M Thornton,et al.  Identifying DNA-binding proteins using structural motifs and the electrostatic potential. , 2004, Nucleic acids research.

[43]  STRUCTURE OF tRNA PSI55 PSEUDOURIDINE SYNTHASE (TRUB) , 2004 .

[44]  Gabriele Varani,et al.  Protein families and RNA recognition , 2005, The FEBS journal.

[45]  Vasant Honavar,et al.  Identifying Interaction Sites in , 2005 .

[46]  Vasant Honavar,et al.  Predicting DNA-binding sites of proteins from amino acid sequence , 2006, BMC Bioinformatics.

[47]  K. Collins,et al.  Two Purified Domains of Telomerase Reverse Transcriptase Reconstitute Sequence-specific Interactions with RNA* , 2005, Journal of Biological Chemistry.

[48]  C. Autexier,et al.  An anchor site-type defect in human telomerase that disrupts telomere length maintenance and cellular immortalization. , 2005, Molecular biology of the cell.

[49]  David R. Westhead,et al.  Improved prediction of protein-Cprotein binding sites using a support vector machines approach , 2005, Bioinform..

[50]  E. Blackburn,et al.  Telomeres and telomerase: their mechanisms of action and the effects of altering their functions , 2005, FEBS letters.

[51]  Deborah S Wuttke,et al.  Soluble domains of telomerase reverse transcriptase identified by high‐throughput screening , 2005, Protein science : a publication of the Protein Society.

[52]  Anna Marie Pyle,et al.  Prediction of functional tertiary interactions and intermolecular interfaces from primary sequence data. , 2005, Journal of experimental zoology. Part B, Molecular and developmental evolution.

[53]  C. Autexier,et al.  The structure and function of telomerase reverse transcriptase. , 2006, Annual review of biochemistry.

[54]  Satoru Miyano,et al.  A Weighted Profile Based Method for Protein-RNA Interacting Residue Prediction , 2006, Trans. Comp. Sys. Biology.

[55]  Tom L Blundell,et al.  An algorithm for predicting protein–protein interaction sites: Abnormally exposed amino acid residues and secondary structure elements , 2006, Protein science : a publication of the Protein Society.

[56]  Yixue Li,et al.  Predicting rRNA-, RNA-, and DNA-binding proteins from primary structure with support vector machines. , 2006, Journal of theoretical biology.

[57]  T. Cech,et al.  Crystal structure of the essential N-terminal domain of telomerase reverse transcriptase , 2006, Nature Structural &Molecular Biology.

[58]  Jae-Hyung Lee,et al.  Identifying Interaction Sites in "Recalcitrant" Proteins: Predicted Protein and RNA Binding Sites in Rev Proteins of HIV-1 and EIAV Agree with Experimental Data , 2006, Pacific Symposium on Biocomputing.

[59]  A. Bonvin,et al.  WHISCY: What information does surface conservation yield? Application to data‐driven docking , 2006, Proteins.