Evaluation of Protein-protein Interaction Predictors with Noisy Partially Labeled Data Sets

Protein-protein interaction (PPI) prediction is an important problem in machine learning and computational biology. However, there is no data set for training or evaluation purposes, where all the instances are accurately labeled. Instead, what is available are instances of positive class (with possibly noisy labels) and no instances of negative class. The non-availability of negative class data is typically handled with the observation that randomly chosen protein-pairs have a nearly 100% chance of being negative class, as only 1 in 1,500 protein pairs expected is expected to be an interacting pair. In this paper, we focused on the problem that non-availability of accurately labeled testing data sets in the domain of protein-protein interaction (PPI) prediction may lead to biased evaluation results. We first showed that not acknowledging the inherent skew in the interactome (i.e. rare occurrence of positive instances) leads to an over-estimated accuracy of the predictor. Then we show that, with the belief that positive interactions are a rare category, sampling random pairs of proteins excluding known interacting proteins set as the negative testing data set could lead to an under-estimated evaluation result. We formalized those two problems to validate the above claim, and based on the formalization, we proposed a balancing method to cancel out the over-estimation with under-estimation. Finally, our experiments validated the theoretical aspects and showed that this balancing evaluation could evaluate the exact performance without availability of golden standard data sets.

[1]  Mark A. Ragan,et al.  Gene Ontology-driven inference of protein-protein interactions using inducers , 2011 .

[2]  Ryo Narita,et al.  Critical Role of an Antiviral Stress Granule Containing RIG-I and PKR in Viral Detection and Innate Immunity , 2012, PloS one.

[3]  Piyali Chatterjee,et al.  PPI_SVM: Prediction of protein-protein interactions using machine learning, domain-domain affinities and frequency tables , 2011, Cellular & Molecular Biology Letters.

[4]  Dmitrij Frishman,et al.  The Negatome database: a reference set of non-interacting protein pairs , 2009, Nucleic Acids Res..

[5]  Malik Yousef,et al.  One-Class SVMs for Document Classification , 2002, J. Mach. Learn. Res..

[6]  Charles Elkan,et al.  Learning gene regulatory networks from only positive and unlabeled data , 2010, BMC Bioinformatics.

[7]  M. Gale,et al.  Immune signaling by RIG-I-like receptors. , 2011, Immunity.

[8]  Edith D. Wong,et al.  Saccharomyces Genome Database: the genomics resource of budding yeast , 2011, Nucleic Acids Res..

[9]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[10]  Shuai Li,et al.  Using Chou's amphiphilic Pseudo-Amino Acid Composition and Extreme Learning Machine for prediction of Protein-protein interactions , 2014, 2014 International Joint Conference on Neural Networks (IJCNN).

[11]  Yangchao Huang,et al.  Simple sequence-based kernels do not predict protein-protein interactions , 2010, Bioinform..

[12]  Zhijian J. Chen,et al.  Structural insights into the activation of RIG‐I, a nanosensor for viral RNAs , 2012, EMBO reports.

[13]  Qiwen Dong,et al.  Prediction of protein - protein interactions from primary sequences , 2010, Int. J. Data Min. Bioinform..

[14]  N. Chandra,et al.  Research , 2000, Veterinary Record.

[15]  Yanzhi Guo,et al.  Using support vector machine combined with auto covariance to predict protein–protein interactions from protein sequences , 2008, Nucleic acids research.

[16]  Dmitrij Frishman,et al.  MIPS: a database for genomes and protein sequences , 1999, Nucleic Acids Res..

[17]  Marcílio Carlos Pereira de Souto,et al.  Issues on sampling negative examples for predicting prokaryotic promoters , 2014, 2014 International Joint Conference on Neural Networks (IJCNN).

[18]  Chris H. Q. Ding,et al.  PSoL: a positive sample only learning algorithm for finding non-coding RNA genes , 2006, Bioinform..

[19]  Yungki Park,et al.  Revisiting the negative example sampling problem for predicting protein-protein interactions , 2011, Bioinform..

[20]  Juwen Shen,et al.  Predicting protein–protein interactions based only on sequences information , 2007, Proceedings of the National Academy of Sciences.

[21]  Edson L. Folador,et al.  Computational Prediction of Protein-Protein Interactions in Leishmania Predicted Proteomes , 2012, PloS one.

[22]  Louise C. Showe,et al.  Learning from positive examples when the negative class is undetermined- microRNA gene identification , 2008, Algorithms for Molecular Biology.

[23]  William Stafford Noble,et al.  Kernel methods for predicting protein-protein interactions , 2005, ISMB.

[24]  Ziv Bar-Joseph,et al.  Evaluation of different biological data and computational classification methods for use in protein interaction prediction , 2006, Proteins.

[25]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[26]  Jonathan L. Schmid-Burgk,et al.  Antiviral activity of human OASL protein is mediated by enhancing signaling of the RIG-I RNA sensor. , 2014, Immunity.

[27]  David A. Gough,et al.  Predicting protein-protein interactions from primary structure , 2001, Bioinform..

[28]  Charles Elkan,et al.  Learning classifiers from only positive and unlabeled data , 2008, KDD.

[29]  Zhu-Hong You,et al.  Prediction of protein-protein interactions from amino acid sequences with ensemble extreme learning machines and principal component analysis , 2013, BMC Bioinformatics.