Computationally predicting protein-RNA interactions using only positive and unlabeled examples

Protein-RNA interactions (PRIs) are considerably important in a wide variety of cellular processes, ranging from transcriptional and post-transcriptional regulations of gene expression to the active defense of host against virus. With the development of high throughput technology, large amounts of PRI information is available for computationally predicting unknown PRIs. In recent years, a number of computational methods for predicting PRIs have been developed in the literature, which usually artificially construct negative samples based on verified nonredundant datasets of PRIs to train classifiers. However, such negative samples are not real negative samples, some even may be unknown positive samples. Consequently, the classifiers trained with such training datasets cannot achieve satisfactory prediction performance. In this paper, we propose a novel method PRIPU that employs biased-support vector machine (SVM) for predicting Protein-RNA Interactions using only Positive and Unlabeled examples. To the best of our knowledge, this is the first work that predicts PRIs using only positive and unlabeled samples. We first collect known PRIs as our benchmark datasets and extract sequence-based features to represent each PRI. To reduce the dimension of feature vectors for lowering computational cost, we select a subset of features by a filter-based feature selection method. Then, biased-SVM is employed to train prediction models with different PRI datasets. To evaluate the new method, we also propose a new performance measure called explicit positive recall (EPR), which is specifically suitable for the task of learning positive and unlabeled data. Experimental results over three datasets show that our method not only outperforms four existing methods, but also is able to predict unknown PRIs. Source code, datasets and related documents of PRIPU are available at: http://admis.fudan.edu.cn/projects/pripu.htm .

[1]  Juwen Shen,et al.  Predicting protein–protein interactions based only on sequences information , 2007, Proceedings of the National Academy of Sciences.

[2]  Xiang-Sun Zhang,et al.  De novo prediction of RNA-protein interactions from sequence information. , 2013, Molecular bioSystems.

[3]  Xiaoli Li,et al.  Learning to Classify Texts Using Positive and Unlabeled Data , 2003, IJCAI.

[4]  E. Martínez-Salas,et al.  Insights into the Biology of IRES Elements through Riboproteomic Approaches , 2010, Journal of biomedicine & biotechnology.

[5]  G. Dreyfuss,et al.  RNA-binding proteins as regulators of gene expression. , 1997, Current opinion in genetics & development.

[6]  Vasant Honavar,et al.  PRIDB: a protein–RNA interface database , 2010, Nucleic Acids Res..

[7]  P. Moore,et al.  The three-dimensional structure of the ribosome and its components. , 1998, Annual review of biophysics and biomolecular structure.

[8]  Jung Hur,et al.  Emerging roles of RNA and RNA-binding protein network in cancer cells. , 2009, BMB reports.

[9]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[10]  N. Pace,et al.  Ribonuclease P: unity and diversity in a tRNA processing ribozyme. , 1998, Annual review of biochemistry.

[11]  Jürg Bähler,et al.  Post-transcriptional control of gene expression: a genome-wide perspective. , 2005, Trends in biochemical sciences.

[12]  D. Moras,et al.  Structural and functional relationships between aminoacyl-tRNA synthetases. , 1992, Trends in biochemical sciences.

[13]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[14]  Philip S. Yu,et al.  Partially Supervised Classification of Text Documents , 2002, ICML.

[15]  S. Jones,et al.  Protein-RNA interactions: a structural analysis. , 2001, Nucleic acids research.

[16]  Y. Z. Chen,et al.  Predicting functional family of novel enzymes irrespective of sequence similarity: a statistical learning approach , 2004, Nucleic acids research.

[17]  V. Ramakrishnan,et al.  Ribosomal protein structures: insights into the architecture, machinery and evolution of the ribosome. , 1998, Trends in biochemical sciences.

[18]  J. Bähler,et al.  In silico characterization and prediction of global protein–mRNA interactions in yeast , 2011, Nucleic acids research.

[19]  K. Hall,et al.  RNA-protein interactions. , 2002, Current opinion in structural biology.

[20]  S. Tenenbaum,et al.  Advances in RIP-chip analysis : RNA-binding protein immunoprecipitation-microarray profiling. , 2008, Methods in molecular biology.

[21]  Frédéric H.-T. Allain,et al.  Sequence-specific binding of single-stranded RNA: is there a code for recognition? , 2006, Nucleic acids research.

[22]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[23]  Vasant Honavar,et al.  Predicting RNA-Protein Interactions Using Only Sequence Information , 2011, BMC Bioinformatics.

[24]  Chee Keong Kwoh,et al.  Positive-unlabeled learning for disease gene identification , 2012, Bioinform..

[25]  M. Crespi,et al.  Non-protein-coding RNAs and their interacting RNA-binding proteins in the plant cell nucleus. , 2010, Molecular plant.

[26]  Alice Barkan,et al.  Genome-wide analysis of RNA-protein interactions in plants. , 2009, Methods in molecular biology.

[27]  Thomas Hermann,et al.  Simulations of the dynamics at an RNA–protein interface , 1999, Nature Structural Biology.

[28]  Vasant G Honavar,et al.  Prediction of RNA binding sites in proteins from amino acid sequence. , 2006, RNA.

[29]  Jonathan J. Ellis,et al.  Protein–RNA interactions: Structural analysis and functional classes , 2006, Proteins.

[30]  Lan Chen,et al.  NPInter: the noncoding RNAs and protein related biomacromolecules interaction database , 2005, Nucleic Acids Res..

[31]  Zhi-Ping Liu,et al.  Prediction of protein-RNA binding sites by a random forest method with combined features , 2010, Bioinform..

[32]  Xiaoli Li,et al.  Ensemble Positive Unlabeled Learning for Disease Gene Identification , 2014, PloS one.

[33]  Jean-Philippe Vert,et al.  ProDiGe: Prioritization Of Disease Genes with multitask machine learning from positive and unlabeled examples , 2011, BMC Bioinformatics.

[34]  Sean P Ryder,et al.  Structure and function of nematode RNA-binding proteins. , 2010, Current opinion in structural biology.

[35]  Gabriele Varani,et al.  A New Method To Detect Long-Range Protein−RNA Contacts: NMR Detection of Electron−Proton Relaxation Induced by Nitroxide Spin-Labeled RNA , 1998 .

[36]  Philip S. Yu,et al.  Building text classifiers using positive and unlabeled examples , 2003, Third IEEE International Conference on Data Mining.

[37]  Michael B. Mathews,et al.  The double-stranded-RNA-binding motif: interference and much more , 2004, Nature Reviews Molecular Cell Biology.

[38]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..