Co-complex protein membership evaluation using Maximum Entropy on GO ontology and InterPro annotation

Abstract Motivation Protein–protein interactions (PPI) play a crucial role in our understanding of protein function and biological processes. The standardization and recording of experimental findings is increasingly stored in ontologies, with the Gene Ontology (GO) being one of the most successful projects. Several PPI evaluation algorithms have been based on the application of probabilistic frameworks or machine learning algorithms to GO properties. Here, we introduce a new training set design and machine learning based approach that combines dependent heterogeneous protein annotations from the entire ontology to evaluate putative co-complex protein interactions determined by empirical studies. Results PPI annotations are built combinatorically using corresponding GO terms and InterPro annotation. We use a S.cerevisiae high-confidence complex dataset as a positive training set. A series of classifiers based on Maximum Entropy and support vector machines (SVMs), each with a composite counterpart algorithm, are trained on a series of training sets. These achieve a high performance area under the ROC curve of ≤0.97, outperforming go2ppi—a previously established prediction tool for protein-protein interactions (PPI) based on Gene Ontology (GO) annotations. Availability and implementation https://github.com/ima23/maxent-ppi Supplementary information Supplementary data are available at Bioinformatics online.

[1]  Haixuan Yang,et al.  Improving GO semantic similarity measures by exploring the ontology beneath the terms and modelling uncertainty , 2012, Bioinform..

[2]  Gary D. Bader,et al.  An improved method for scoring protein-protein interactions using semantic similarity within the gene ontology , 2010, BMC Bioinformatics.

[3]  Boguslaw Stec,et al.  The Fas/FADD death domain complex structure unravels signaling by receptor clustering , 2008, Nature.

[4]  David S. Goodsell,et al.  The RCSB protein data bank: integrative view of protein, gene and 3D structural information , 2016, Nucleic Acids Res..

[5]  Jyoti S Choudhary,et al.  Mapping multiprotein complexes by affinity purification and mass spectrometry. , 2008, Current opinion in biotechnology.

[6]  Irina M. Armean,et al.  In Vivo Analysis of Proteomes and Interactomes Using Parallel Affinity Capture (iPAC) Coupled to Mass Spectrometry , 2011, Molecular & Cellular Proteomics.

[7]  Benjamin A. Shoemaker,et al.  Deciphering Protein–Protein Interactions. Part I. Experimental Techniques and Databases , 2007, PLoS Comput. Biol..

[8]  Christophe Dessimoz,et al.  Phylogenetic Profiling: How Much Input Data Is Enough? , 2015, PloS one.

[9]  Dmitrij Frishman,et al.  The Negatome database: a reference set of non-interacting protein pairs , 2009, Nucleic Acids Res..

[10]  Carol V Robinson,et al.  Mass spectrometry of protein complexes: from origins to applications. , 2015, Annual review of physical chemistry.

[11]  Asa Ben-Hur,et al.  The use of gene ontology evidence codes in preventing classifier assessment bias , 2009, Bioinform..

[12]  B. Alberts The Cell as a Collection of Protein Machines: Preparing the Next Generation of Molecular Biologists , 1998, Cell.

[13]  Shmuel Sattath,et al.  How reliable are experimental protein-protein interaction data? , 2003, Journal of molecular biology.

[14]  T. Dandekar,et al.  Functional Module Search in Protein Networks based on Semantic Similarity Improves the Analysis of Proteomics Data* , 2014, Molecular & Cellular Proteomics.

[15]  L. Bonetta Protein–protein interactions: Interactome under construction , 2010, Nature.

[16]  William Stafford Noble,et al.  Choosing negative examples for the prediction of protein-protein interactions , 2006, BMC Bioinformatics.

[17]  Davide Heller,et al.  STRING v10: protein–protein interaction networks, integrated over the tree of life , 2014, Nucleic Acids Res..

[18]  M. Gerstein,et al.  A Bayesian Networks Approach for Predicting Protein-Protein Interactions from Genomic Data , 2003, Science.

[19]  Huiru Zheng,et al.  GRIP: A web-based system for constructing Gold Standard datasets for protein-protein interaction prediction , 2008, Source Code for Biology and Medicine.

[20]  M. Vidal,et al.  Protein interaction mapping in C. elegans using proteins involved in vulval development. , 2000, Science.

[21]  David L. Robertson,et al.  Protein Interactions from Complexes: A Structural Perspective , 2006, Comparative and functional genomics.

[22]  Beatriz García Jiménez,et al.  EcID. A database for the inference of functional interactions in E. coli , 2008, Nucleic Acids Res..

[23]  Chao Qin,et al.  A New Method for Identifying Essential Proteins Based on Network Topology Properties and Protein Complexes , 2016, PloS one.

[24]  Ruedi Aebersold,et al.  Characterization of a DNA exit gate in the human cohesin ring , 2014, Science.

[25]  Dongsoo Han,et al.  A domain combination based probabilistic framework for protein-protein interaction prediction. , 2003, Genome informatics. International Conference on Genome Informatics.

[26]  Liisa Holm,et al.  Evaluation of different domain-based methods in protein interaction prediction. , 2009, Biochemical and biophysical research communications.

[27]  Hong Guo,et al.  PPCM: Combing Multiple Classifiers to Improve Protein-Protein Interaction Prediction , 2015, International journal of genomics.

[28]  K. Dolinski,et al.  Use and misuse of the gene ontology annotations , 2008, Nature Reviews Genetics.

[29]  Olivier Dameron,et al.  Semantic Particularity Measure for Functional Characterization of Gene Sets Using Gene Ontology , 2014, PloS one.

[30]  Christophe Dessimoz,et al.  Quality of Computationally Inferred Gene Ontology Annotations , 2012, PLoS Comput. Biol..

[31]  Julie M. Sullivan,et al.  FlyMine: an integrated database for Drosophila and Anopheles genomics , 2007, Genome Biology.

[32]  Chuan-Tien Hung,et al.  Control of the negative IRES trans-acting factor KHSRP by ubiquitination , 2016, Nucleic acids research.

[33]  E. Sprinzak,et al.  Correlated sequence-signatures as markers of protein-protein interaction. , 2001, Journal of molecular biology.

[34]  A. Valencia,et al.  Emerging methods in protein co-evolution , 2013, Nature Reviews Genetics.

[35]  Steven Russell,et al.  The Flannotator - a gene and protein expression annotation tool for Drosophila melanogaster , 2009, Bioinform..

[36]  Phillip W. Lord,et al.  Semantic Similarity in Biomedical Ontologies , 2009, PLoS Comput. Biol..

[37]  Arnaud Céol,et al.  3did: a catalog of domain-based interactions of known three-dimensional structure , 2013, Nucleic Acids Res..

[38]  C. Deane,et al.  Protein Interactions , 2002, Molecular & Cellular Proteomics.

[39]  Sailu Yellaboina,et al.  DOMINE: a comprehensive collection of known and predicted domain-domain interactions , 2010, Nucleic Acids Res..

[40]  Kathryn S Lilley,et al.  Popular Computational Methods to Assess Multiprotein Complexes Derived From Label-Free Affinity Purification and Mass Spectrometry (AP-MS) Experiments* , 2012, Molecular & Cellular Proteomics.

[41]  Florencio Pazos,et al.  Practical aspects of protein co-evolution , 2014, Front. Cell Dev. Biol..

[42]  Sanghamitra Bandyopadhyay,et al.  A New Feature Vector Based on Gene Ontology Terms for Protein-Protein Interaction Prediction , 2017, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[43]  Daniel W. A. Buchan,et al.  A large-scale evaluation of computational protein function prediction , 2013, Nature Methods.

[44]  B. Honig,et al.  Structure-based prediction of protein-protein interactions on a genome-wide scale , 2012, Nature.

[45]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[46]  Sundari Suresh,et al.  Quantitative analysis of protein interaction network dynamics in yeast , 2017, Molecular systems biology.

[47]  Guomin Liu,et al.  SAINTexpress: improvements and additional features in Significance Analysis of INTeractome software. , 2014, Journal of proteomics.

[48]  Huaiyu Mi,et al.  The InterPro protein families database: the classification resource after 15 years , 2014, Nucleic Acids Res..

[49]  Kara Dolinski,et al.  The BioGRID Interaction Database: 2011 update , 2010, Nucleic Acids Res..

[50]  Albert J R Heck,et al.  Proteome-wide profiling of protein assemblies by cross-linking mass spectrometry , 2015, Nature Methods.

[51]  Greg W. Clark,et al.  Panorama of ancient metazoan macromolecular complexes , 2015, Nature.

[52]  Tobias Müller,et al.  Identifying functional modules in protein–protein interaction networks: an integrated exact approach , 2008, ISMB.

[53]  Jim Thurmond,et al.  FlyBase 101 – the basics of navigating FlyBase , 2011, Nucleic Acids Res..

[54]  Avigdor Gal Ontology Engineering , 2009, Encyclopedia of Database Systems.

[55]  A. Barabasi,et al.  Interactome Networks and Human Disease , 2011, Cell.

[56]  Nicola J. Mulder,et al.  Information Content-Based Gene Ontology Semantic Similarity Approaches: Toward a Unified Framework Theory , 2013, BioMed research international.

[57]  J. Darroch,et al.  Generalized Iterative Scaling for Log-Linear Models , 1972 .

[58]  A. H. Smits,et al.  Characterizing Protein-Protein Interactions Using Mass Spectrometry: Challenges and Opportunities. , 2016, Trends in biotechnology.

[59]  Matteo Pellegrini,et al.  Prolinks: a database of protein functional linkages derived from coevolution , 2004, Genome Biology.

[60]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[61]  Mark A. Ragan,et al.  Gene Ontology-driven inference of protein-protein interactions using inducers , 2011 .

[62]  Amber L. Couzens,et al.  The CRAPome: a Contaminant Repository for Affinity Purification Mass Spectrometry Data , 2013, Nature Methods.

[63]  Tao Liu,et al.  TreeFam: 2008 Update , 2007, Nucleic Acids Res..

[64]  Mark A. Ragan,et al.  Automatic selection of reference taxa for protein-protein interaction prediction with phylogenetic profiling , 2012, Bioinform..

[65]  Xiaoyan Liu,et al.  An improved method for functional similarity analysis of genes based on Gene Ontology , 2016, BMC Systems Biology.

[66]  S. Pu,et al.  Up-to-date catalogues of yeast protein complexes , 2008, Nucleic acids research.

[67]  Kathryn S. Lilley,et al.  SILAC-iPAC: A quantitative method for distinguishing genuine from non-specific components of protein complexes by parallel affinity capture , 2015, Journal of proteomics.

[68]  Rafael C. Jimenez,et al.  The IntAct molecular interaction database in 2012 , 2011, Nucleic Acids Res..

[69]  P. Aloy,et al.  Interactome3D: adding structural details to protein networks , 2013, Nature Methods.

[70]  Holger Fröhlich,et al.  GOSim – an R-package for computation of information theoretic GO similarities between terms and gene products , 2007, BMC Bioinformatics.

[71]  Matthias Landgraf,et al.  Analysis of the expression patterns, subcellular localisations and interaction partners of Drosophila proteins using a pigP protein trap library , 2014, Development.