Selecting high-quality negative samples for effectively predicting protein-RNA interactions

BackgroundThe identification of Protein-RNA Interactions (PRIs) is important to understanding cell activities. Recently, several machine learning-based methods have been developed for identifying PRIs. However, the performance of these methods is unsatisfactory. One major reason is that they usually use unreliable negative samples in the training process.MethodsFor boosting the performance of PRI prediction, we propose a novel method to generate reliable negative samples. Concretely, we firstly collect the known PRIs as positive samples for generating positive sets. For each positive set, we construct two corresponding negative sets, one is by our method and the other by random method. Each positive set is combined with a negative set to form a dataset for model training and performance evaluation. Consequently, we get 18 datasets of different species and different ratios of negative samples to positive samples.Secondly, sequence-based features are extracted to represent each of PRIs and protein-RNA pairs in the datasets. A filter-based method is employed to cut down the dimensionality of feature vectors for reducing computational cost. Finally, the performance of support vector machine (SVM), random forest (RF) and naive Bayes (NB) is evaluated on the generated 18 datasets.ResultsExtensive experiments show that comparing to using randomly-generated negative samples, all classifiers achieve substantial performance improvement by using negative samples selected by our method. The improvements on accuracy and geometric mean for the SVM classifier, the RF classifier and the NB classifier are as high as 204.5 and 68.7%, 174.5 and 53.9%, 80.9 and 54.3%, respectively.ConclusionOur method is useful to the identification of PRIs.

[1]  Stephen Kwek,et al.  Applying Support Vector Machines to Imbalanced Datasets , 2004, ECML.

[2]  N. Pace,et al.  Ribonuclease P: unity and diversity in a tRNA processing ribozyme. , 1998, Annual review of biochemistry.

[3]  D. Moras,et al.  Structural and functional relationships between aminoacyl-tRNA synthetases. , 1992, Trends in biochemical sciences.

[4]  Damian Szklarczyk,et al.  The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored , 2010, Nucleic Acids Res..

[5]  Xiang-Sun Zhang,et al.  De novo prediction of RNA-protein interactions from sequence information. , 2013, Molecular bioSystems.

[6]  Vasant Honavar,et al.  Predicting RNA-Protein Interactions Using Only Sequence Information , 2011, BMC Bioinformatics.

[7]  Shuigeng Zhou,et al.  Computationally predicting protein-RNA interactions using only positive and unlabeled examples , 2015, J. Bioinform. Comput. Biol..

[8]  Michael B. Mathews,et al.  The double-stranded-RNA-binding motif: interference and much more , 2004, Nature Reviews Molecular Cell Biology.

[9]  Ni Li,et al.  Gene Ontology Annotations and Resources , 2012, Nucleic Acids Res..

[10]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[11]  Mário J. Silva,et al.  Measuring semantic similarity between Gene Ontology terms , 2007, Data Knowl. Eng..

[12]  Y. Z. Chen,et al.  Predicting functional family of novel enzymes irrespective of sequence similarity: a statistical learning approach , 2004, Nucleic acids research.

[13]  María Martín,et al.  Activities at the Universal Protein Resource (UniProt) , 2013, Nucleic Acids Res..

[14]  Juwen Shen,et al.  Predicting protein–protein interactions based only on sequences information , 2007, Proceedings of the National Academy of Sciences.

[15]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[16]  Gabriele Varani,et al.  A New Method To Detect Long-Range Protein−RNA Contacts: NMR Detection of Electron−Proton Relaxation Induced by Nitroxide Spin-Labeled RNA , 1998 .

[17]  The UniProt Consortium,et al.  Update on activities at the Universal Protein Resource (UniProt) in 2013 , 2012, Nucleic Acids Res..

[18]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[19]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[20]  Jürg Bähler,et al.  Post-transcriptional control of gene expression: a genome-wide perspective. , 2005, Trends in biochemical sciences.

[21]  S. Jones,et al.  Protein-RNA interactions: a structural analysis. , 2001, Nucleic acids research.

[22]  P. Moore,et al.  The three-dimensional structure of the ribosome and its components. , 1998, Annual review of biophysics and biomolecular structure.

[23]  Thomas Hermann,et al.  Simulations of the dynamics at an RNA–protein interface , 1999, Nature Structural Biology.

[24]  Vasant G Honavar,et al.  Prediction of RNA binding sites in proteins from amino acid sequence. , 2006, RNA.

[25]  G. Dreyfuss,et al.  RNA-binding proteins as regulators of gene expression. , 1997, Current opinion in genetics & development.

[26]  Vasant Honavar,et al.  PRIDB: a protein–RNA interface database , 2010, Nucleic Acids Res..

[27]  Frédéric H.-T. Allain,et al.  Sequence-specific binding of single-stranded RNA: is there a code for recognition? , 2006, Nucleic acids research.

[28]  V. Ramakrishnan,et al.  Ribosomal protein structures: insights into the architecture, machinery and evolution of the ribosome. , 1998, Trends in biochemical sciences.

[29]  J. Bähler,et al.  In silico characterization and prediction of global protein–mRNA interactions in yeast , 2011, Nucleic acids research.

[30]  K. Hall,et al.  RNA-protein interactions. , 2002, Current opinion in structural biology.

[31]  Wei Wu,et al.  NPInter v2.0: an updated database of ncRNA interactions , 2013, Nucleic Acids Res..

[32]  Jonathan J. Ellis,et al.  Protein–RNA interactions: Structural analysis and functional classes , 2006, Proteins.

[33]  Zhi-Ping Liu,et al.  Prediction of protein-RNA binding sites by a random forest method with combined features , 2010, Bioinform..