Predicting co-complexed protein pairs using genomic and proteomic data integration

BackgroundIdentifying all protein-protein interactions in an organism is a major objective of proteomics. A related goal is to know which protein pairs are present in the same protein complex. High-throughput methods such as yeast two-hybrid (Y2H) and affinity purification coupled with mass spectrometry (APMS) have been used to detect interacting proteins on a genomic scale. However, both Y2H and APMS methods have substantial false-positive rates. Aside from high-throughput interaction screens, other gene- or protein-pair characteristics may also be informative of physical interaction. Therefore it is desirable to integrate multiple datasets and utilize their different predictive value for more accurate prediction of co-complexed relationship.ResultsUsing a supervised machine learning approach – probabilistic decision tree, we integrated high-throughput protein interaction datasets and other gene- and protein-pair characteristics to predict co-complexed pairs (CCP) of proteins. Our predictions proved more sensitive and specific than predictions based on Y2H or APMS methods alone or in combination. Among the top predictions not annotated as CCPs in our reference set (obtained from the MIPS complex catalogue), a significant fraction was found to physically interact according to a separate database (YPD, Yeast Proteome Database), and the remaining predictions may potentially represent unknown CCPs.ConclusionsWe demonstrated that the probabilistic decision tree approach can be successfully used to predict co-complexed protein (CCP) pairs from other characteristics. Our top-scoring CCP predictions provide testable hypotheses for experimental validation.

[1]  J. Claverie,et al.  What If There Are Only 30,000 Human Genes? , 2001, Science.

[2]  D. Eisenberg,et al.  A combined algorithm for genome-wide prediction of protein function , 1999, Nature.

[3]  M. Gerstein,et al.  Subcellular localization of the yeast proteome. , 2002, Genes & development.

[4]  D. Eisenberg,et al.  Detecting protein function and protein-protein interactions from genome sequences. , 1999, Science.

[5]  P. Bork,et al.  Functional organization of the yeast proteome by systematic analysis of protein complexes , 2002, Nature.

[6]  Joan Brooks,et al.  Three yeast proteome databases: YPD, PombePD, and CalPD (MycoPathPD). , 2002, Methods in enzymology.

[7]  T. Ito,et al.  Toward a protein-protein interaction map of the budding yeast: A comprehensive system to examine two-hybrid interactions in all possible combinations between the yeast proteins. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Francis D. Gibbons,et al.  Predicting protein complex membership using probabilistic network reliability. , 2004, Genome research.

[9]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[10]  A. Owen,et al.  A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae) , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[11]  J. Ni,et al.  Small Nucleolar RNAs Direct Site-Specific Synthesis of Pseudouridine in Ribosomal RNA , 1997, Cell.

[12]  M. Gerstein,et al.  Integrating Interactomes , 2002, Science.

[13]  S. Dwight,et al.  Predicting gene function from patterns of annotation. , 2003, Genome research.

[14]  P. Thuriaux,et al.  Suppression of yeast RNA polymerase III mutations by FHL1, a gene coding for a fork head protein involved in rRNA processing , 1994, Molecular and cellular biology.

[15]  M. Gerstein,et al.  A Bayesian Networks Approach for Predicting Protein-Protein Interactions from Genomic Data , 2003, Science.

[16]  G. Blobel,et al.  Crystallographic Analysis of the Recognition of a Nuclear Localization Signal by the Nuclear Import Factor Karyopherin α , 1998, Cell.

[17]  M. Gerstein,et al.  Integration of genomic datasets to predict protein complexes in yeast , 2004, Journal of Structural and Functional Genomics.

[18]  B. Schwikowski,et al.  A network of protein–protein interactions in yeast , 2000, Nature Biotechnology.

[19]  M. Gerstein,et al.  Relating whole-genome expression data with protein-protein interactions. , 2002, Genome research.

[20]  S. Fields,et al.  Networking proteins in yeast , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[21]  Ronald W. Davis,et al.  A genome-wide transcriptional analysis of the mitotic cell cycle. , 1998, Molecular cell.

[22]  Randy C. Axelrod,et al.  Predicting the effects of gene deletion , 2002, SKDD.

[23]  M. Vidal,et al.  Integrating 'omic' information: a bridge between genomics and systems biology. , 2003, Trends in genetics : TIG.

[24]  Nicola J. Rinaldi,et al.  Transcriptional Regulatory Networks in Saccharomyces cerevisiae , 2002, Science.

[25]  Dmitrij Frishman,et al.  MIPS: a database for genomes and protein sequences , 1999, Nucleic Acids Res..

[26]  Frederick P. Roth,et al.  Predicting phenotype from patterns of annotation , 2003, ISMB.

[27]  Yudong D. He,et al.  Functional Discovery via a Compendium of Expression Profiles , 2000, Cell.

[28]  A K Hopper,et al.  SRD1, a S. cerevisiae gene affecting pre-rRNA processing contains a C2/C2 zinc finger motif. , 1994, Nucleic acids research.

[29]  James R. Knight,et al.  A comprehensive analysis of protein–protein interactions in Saccharomyces cerevisiae , 2000, Nature.

[30]  P. Kemmeren,et al.  Protein interaction verification and functional annotation by integrated analysis of genome-scale data. , 2002, Molecular cell.

[31]  C. Deane,et al.  Protein Interactions , 2002, Molecular & Cellular Proteomics.

[32]  B. Snel,et al.  Comparative assessment of large-scale data sets of protein–protein interactions , 2002, Nature.

[33]  D. Eisenberg,et al.  Protein interaction databases. , 2001, Current opinion in biotechnology.

[34]  Gary D Bader,et al.  Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry , 2002, Nature.

[35]  Gary D. Bader,et al.  An automated method for finding molecular complexes in large protein interaction networks , 2003, BMC Bioinformatics.

[36]  Jason Weston,et al.  Learning Gene Functional Classifications from Multiple Data Types , 2002, J. Comput. Biol..

[37]  Joel S. Bader,et al.  Greedily building protein networks with confidence , 2003, Bioinform..

[38]  J. Shabanowitz,et al.  A large nucleolar U3 ribonucleoprotein required for 18S ribosomal RNA biogenesis , 2002, Nature.

[39]  G. Church,et al.  Correlation between transcriptome and interactome mapping data from Saccharomyces cerevisiae , 2001, Nature Genetics.

[40]  R. Ozawa,et al.  A comprehensive two-hybrid analysis to explore the yeast protein interactome , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[41]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[42]  Gary D Bader,et al.  Analyzing yeast protein–protein interaction data obtained from different sources , 2002, Nature Biotechnology.

[43]  A. Grigoriev A relationship between gene expression and protein interactions on the proteome scale: analysis of the bacteriophage T7 and the yeast Saccharomyces cerevisiae. , 2001, Nucleic acids research.