Ensemble learning prediction of protein-protein interactions using proteins functional annotations.

Protein-protein interactions are important for the majority of biological processes. A significant number of computational methods have been developed to predict protein-protein interactions using protein sequence, structural and genomic data. Vast experimental data is publicly available on the Internet, but it is scattered across numerous databases. This fact motivated us to create and evaluate new high-throughput datasets of interacting proteins. We extracted interaction data from DIP, MINT, BioGRID and IntAct databases. Then we constructed descriptive features for machine learning purposes based on data from Gene Ontology and DOMINE. Thereafter, four well-established machine learning methods: Support Vector Machine, Random Forest, Decision Tree and Naïve Bayes, were used on these datasets to build an Ensemble Learning method based on majority voting. In cross-validation experiment, sensitivity exceeded 80% and classification/prediction accuracy reached 90% for the Ensemble Learning method. We extended the experiment to a bigger and more realistic dataset maintaining sensitivity over 70%. These results confirmed that our datasets are suitable for performing PPI prediction and Ensemble Learning method is well suited for this task. Both the processed PPI datasets and the software are available at .

[1]  Louxin Zhang,et al.  Counting motifs in the human interactome , 2013, Nature Communications.

[2]  Damian Szklarczyk,et al.  STRING v9.1: protein-protein interaction networks, with increased coverage and integration , 2012, Nucleic Acids Res..

[3]  Dariusz Plewczynski,et al.  Protein-protein interaction and pathway databases, a graphical review , 2011, Briefings Bioinform..

[4]  Piyali Chatterjee,et al.  PPI_SVM: Prediction of protein-protein interactions using machine learning, domain-domain affinities and frequency tables , 2011, Cellular & Molecular Biology Letters.

[5]  Sailu Yellaboina,et al.  DOMINE: a comprehensive collection of known and predicted domain-domain interactions , 2010, Nucleic Acids Res..

[6]  R. Kolde,et al.  Mining for coexpression across hundreds of datasets using novel rank aggregation and visualization methods , 2009, Genome Biology.

[7]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[8]  R. Sharan,et al.  A complex-centric view of protein network evolution , 2009, Nucleic acids research.

[9]  Sylvie Ricard-Blum,et al.  MatrixDB, a database focused on extracellular protein–protein and protein–carbohydrate interactions , 2009, Bioinform..

[10]  Guozhen Liu,et al.  DroID: the Drosophila Interactions Database, a comprehensive resource for annotated gene and protein interactions , 2008, BMC Genomics.

[11]  Ian M. Donaldson,et al.  iRefIndex: A consolidated protein interaction database with provenance , 2008, BMC Bioinformatics.

[12]  Bin Li,et al.  The Innate Immune Database (IIDB) , 2008, BMC Immunology.

[13]  Yoshihiro Yamanishi,et al.  KEGG for linking genomes to life and the environment , 2007, Nucleic Acids Res..

[14]  K. Dolinski,et al.  The BioGRID Interaction Database: 2008 update , 2007, Nucleic Acids Res..

[15]  Yen-Han Lin,et al.  Prediction of Protein-Protein Interactions Using Protein Signature Profiling , 2007, Genom. Proteom. Bioinform..

[16]  H. Herzel,et al.  UniHI: an entry gate to the human protein interactome , 2006, Nucleic Acids Res..

[17]  Y. Zhang,et al.  IntAct—open source resource for molecular interaction data , 2006, Nucleic Acids Res..

[18]  Bin Liu,et al.  Michigan Molecular Interactions (MiMI): putting the jigsaw puzzle together , 2006, Nucleic Acids Res..

[19]  R. Polikar,et al.  Ensemble based systems in decision making , 2006, IEEE Circuits and Systems Magazine.

[20]  Carlos Prieto,et al.  APID: Agile Protein Interaction DataAnalyzer , 2006, Nucleic Acids Res..

[21]  T. Ideker,et al.  Modeling cellular machinery through biological network comparison , 2006, Nature Biotechnology.

[22]  Patrick Lambrix,et al.  Representations of molecular pathways: an evaluation of SBML, PSI MI and BioPAX , 2005, Bioinform..

[23]  T. Barrette,et al.  Probabilistic model of the human protein-protein interaction network , 2005, Nature Biotechnology.

[24]  S. L. Wong,et al.  A Map of the Interactome Network of the Metazoan C. elegans , 2004, Science.

[25]  James R. Knight,et al.  A Protein Interaction Map of Drosophila melanogaster , 2003, Science.

[26]  A. Grigoriev On the number of protein-protein interactions in the yeast proteome. , 2003, Nucleic acids research.

[27]  Rolf Apweiler,et al.  The Proteomics Standards Initiative , 2003, Proteomics.

[28]  Hugh D. Spence,et al.  The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models , 2003, Bioinform..

[29]  C. Deane,et al.  Protein Interactions , 2002, Molecular & Cellular Proteomics.

[30]  P. Bork,et al.  Functional organization of the yeast proteome by systematic analysis of protein complexes , 2002, Nature.

[31]  Samy Bengio,et al.  SVMTorch: Support Vector Machines for Large-Scale Regression Problems , 2001, J. Mach. Learn. Res..

[32]  Tom Fawcett,et al.  Robust Classification for Imprecise Environments , 2000, Machine Learning.

[33]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[34]  M. Shaw,et al.  Induction of fuzzy decision trees , 1995 .

[35]  W. Cleveland Robust Locally Weighted Regression and Smoothing Scatterplots , 1979 .

[36]  M. Friedman A Comparison of Alternative Tests of Significance for the Problem of $m$ Rankings , 1940 .

[37]  M. Friedman The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance , 1937 .

[38]  Martin Mozina,et al.  Orange: data mining toolbox in python , 2013, J. Mach. Learn. Res..

[39]  Akhilesh Pandey,et al.  Human Protein Reference Database and Human Proteinpedia as discovery tools for systems biology. , 2009, Methods in molecular biology.

[40]  鄭素梅,et al.  Nature Publishing Group , 2006 .

[41]  Dong-Soo Han,et al.  PreSPI: a domain combination based prediction system for protein-protein interaction. , 2004, Nucleic acids research.

[42]  Maria Victoria Schneider,et al.  MINT: a Molecular INTeraction database. , 2002, FEBS letters.

[43]  L. Breiman Random Forests , 2001, Machine Learning.