Bayesian Inference for Genomic Data Integration Reduces Misclassification Rate in Predicting Protein-Protein Interactions

Protein-protein interactions (PPIs) are essential to most fundamental cellular processes. There has been increasing interest in reconstructing PPIs networks. However, several critical difficulties exist in obtaining reliable predictions. Noticeably, false positive rates can be as high as >80%. Error correction from each generating source can be both time-consuming and inefficient due to the difficulty of covering the errors from multiple levels of data processing procedures within a single test. We propose a novel Bayesian integration method, deemed nonparametric Bayes ensemble learning (NBEL), to lower the misclassification rate (both false positives and negatives) through automatically up-weighting data sources that are most informative, while down-weighting less informative and biased sources. Extensive studies indicate that NBEL is significantly more robust than the classic naïve Bayes to unreliable, error-prone and contaminated data. On a large human data set our NBEL approach predicts many more PPIs than naïve Bayes. This suggests that previous studies may have large numbers of not only false positives but also false negatives. The validation on two human PPIs datasets having high quality supports our observations. Our experiments demonstrate that it is feasible to predict high-throughput PPIs computationally with substantially reduced false positives and false negatives. The ability of predicting large numbers of PPIs both reliably and automatically may inspire people to use computational approaches to correct data errors in general, and may speed up PPIs prediction with high quality. Such a reliable prediction may provide a solid platform to other studies such as protein functions prediction and roles of PPIs in disease susceptibility.

[1]  Christian von Mering,et al.  STRING: known and predicted protein–protein associations, integrated and transferred across organisms , 2004, Nucleic Acids Res..

[2]  A. Barabasi,et al.  High-Quality Binary Protein Interaction Map of the Yeast Interactome Network , 2008, Science.

[3]  A. Valencia,et al.  Similarity of phylogenetic trees as indicator of protein-protein interaction. , 2001, Protein engineering.

[4]  Ziv Bar-Joseph,et al.  Evaluation of different biological data and computational classification methods for use in protein interaction prediction , 2006, Proteins.

[5]  A. Owen,et al.  A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae) , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Geoffrey J. Barton,et al.  Probabilistic prediction and ranking of human protein-protein interactions , 2007, BMC Bioinformatics.

[7]  Chuan Wang,et al.  InPrePPI: an integrated evaluation method based on genomic context for predicting protein-protein interactions in prokaryotic genomes , 2007, BMC Bioinformatics.

[8]  David Z. D'Argenio,et al.  Prediction of human functional genetic networks from heterogeneous data using RVM-based ensemble learning , 2010, Bioinform..

[9]  Hao Yu,et al.  Discovering patterns to extract protein-protein interactions from full texts , 2004, Bioinform..

[10]  M. Gerstein,et al.  Assessing the limits of genomic data integration for predicting protein networks. , 2005, Genome research.

[11]  Christian J Stoeckert,et al.  Computational modeling of the Plasmodium falciparum interactome reveals protein function on a genome-wide scale. , 2006, Genome research.

[12]  T. Barrette,et al.  Probabilistic model of the human protein-protein interaction network , 2005, Nature Biotechnology.

[13]  Hanno Steen,et al.  Development of human protein reference database as an initial platform for approaching systems biology in humans. , 2003, Genome research.

[14]  P. Bork,et al.  Structure-Based Assembly of Protein Complexes in Yeast , 2004, Science.

[15]  Lancelot F. James,et al.  Gibbs Sampling Methods for Stick-Breaking Priors , 2001 .

[16]  Mark Gerstein,et al.  Information assessment on predicting protein-protein interactions , 2004, BMC Bioinformatics.

[17]  Hui Lu,et al.  Multimeric threading-based prediction of protein-protein interactions on a genomic scale: application to the Saccharomyces cerevisiae proteome. , 2003, Genome research.

[18]  Robert Hoffmann,et al.  HomoMINT: an inferred human network based on orthology mapping of protein interactions discovered in model organisms , 2005, BMC Bioinformatics.

[19]  Leroy Hood,et al.  The impact of systems approaches on biological problems in drug discovery , 2004, Nature Biotechnology.

[20]  P. Bork,et al.  Literature mining for the biologist: from information retrieval to biological discovery , 2006, Nature Reviews Genetics.

[21]  H. Herzel,et al.  Is there a bias in proteome research? , 2001, Genome research.

[22]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[23]  Byoung-Tak Zhang,et al.  PIE: an online prediction system for protein–protein interactions from text , 2008, Nucleic Acids Res..

[24]  A. Hopkins Network pharmacology , 2007, Nature Biotechnology.

[25]  A. Barabasi,et al.  Drug—target network , 2007, Nature Biotechnology.

[26]  Marit Ackermann,et al.  Accounting for Redundancy when Integrating Gene Interaction Databases , 2009, PloS one.

[27]  Chern-Sing Goh,et al.  Co-evolutionary analysis reveals insights into protein-protein interactions. , 2002, Journal of molecular biology.

[28]  B. Snel,et al.  Comparative assessment of large-scale data sets of protein–protein interactions , 2002, Nature.

[29]  J. Rothberg,et al.  Gaining confidence in high-throughput protein interaction networks , 2004, Nature Biotechnology.

[30]  Joel S. Bader,et al.  Precision and recall estimates for two-hybrid screens , 2008, Bioinform..

[31]  M. Gerstein,et al.  A Bayesian Networks Approach for Predicting Protein-Protein Interactions from Genomic Data , 2003, Science.

[32]  Sandhya Rani,et al.  Human Protein Reference Database—2009 update , 2008, Nucleic Acids Res..

[33]  Sophia Tsoka,et al.  Prediction of protein interactions: metabolic enzymes are frequently involved in gene fusion , 2000, Nature Genetics.

[34]  G. Sumara,et al.  A Probabilistic Functional Network of Yeast Genes , 2004 .

[35]  Jinfeng Zhang,et al.  Bayesian inference of protein-protein interactions from biological literature , 2009, Bioinform..

[36]  David Warde-Farley,et al.  Dynamic modularity in protein interaction networks predicts breast cancer outcome , 2009, Nature Biotechnology.

[37]  Huiru Zheng,et al.  A knowledge-driven probabilistic framework for the prediction of protein-protein interaction networks , 2010, Comput. Biol. Medicine.

[38]  Anton J. Enright,et al.  Protein interaction maps for complete genomes based on gene fusion events , 1999, Nature.

[39]  Arno Siebes,et al.  Data and text mining Combination of text-mining algorithms increases the performance , 2006 .

[40]  E. Kunkel Systems biology in drug discovery , 2004, Nature Biotechnology.

[41]  P Sham,et al.  Shifting paradigms in gene-mapping methodology for complex traits. , 2001, Pharmacogenomics.

[42]  Ziv Bar-Joseph,et al.  A mixture of feature experts approach for protein-protein interaction prediction , 2007, BMC Bioinformatics.

[43]  Gene Ontology Consortium The Gene Ontology (GO) database and informatics resource , 2003 .

[44]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[45]  Frederick P. Roth,et al.  Predicting co-complexed protein pairs using genomic and proteomic data integration , 2004, BMC Bioinformatics.

[46]  Dmitrij Frishman,et al.  The MIPS mammalian protein?Cprotein interaction database , 2005, Bioinform..

[47]  Benjamin A. Shoemaker,et al.  Deciphering Protein–Protein Interactions. Part II. Computational Methods to Predict Protein and Domain Interaction Partners , 2007, PLoS Comput. Biol..

[48]  D. Dunson,et al.  Bayesian nonparametric inference on stochastic ordering. , 2008, Biometrika.

[49]  Ioannis Xenarios,et al.  Mining literature for protein-protein interactions , 2001, Bioinform..

[50]  Arun K. Ramani,et al.  Exploiting the co-evolution of interacting proteins to discover interaction specificity. , 2003, Journal of molecular biology.

[51]  Arun K. Ramani,et al.  How complete are current yeast and human protein-interaction networks? , 2006, Genome Biology.

[52]  Arun K. Ramani,et al.  Protein interaction networks from yeast to human. , 2004, Current opinion in structural biology.

[53]  Peer Bork,et al.  Extraction of regulatory gene/protein networks from Medline , 2006, Bioinform..

[54]  S. Lovell,et al.  Protein-protein interaction networks and biology—what's the connection? , 2008, Nature Biotechnology.

[55]  A. Valencia,et al.  Computational methods for the prediction of protein interactions. , 2002, Current opinion in structural biology.

[56]  William C Hahn,et al.  Identification of genotype-selective antitumor agents using synthetic lethal chemical screening in engineered human tumor cells. , 2003, Cancer cell.