论文信息 - Machine learning for the prediction of protein-protein interactions

Machine learning for the prediction of protein-protein interactions

The prediction of protein-protein interactions (PPI) has recently emerged as an important problem in the fields of bioinformatics and systems biology, due to the fact that most essential cellular processes are mediated by these kinds of interactions. In this thesis we focussed in the prediction of co-complex interactions, where the objective is to identify and characterize protein pairs which are members of the same protein complex. Although high-throughput methods for the direct identification of PPI have been developed in the last years. It has been demonstrated that the data obtained by these methods is often incomplete and suffers from high false-positive and false-negative rates. In order to deal with this technology-driven problem, several machine learning techniques have been employed in the past to improve the accuracy and trustability of predicted protein interacting pairs, demonstrating that the combined use of direct and indirect biological insights can improve the quality of predictive PPI models. This task has been commonly viewed as a binary classification problem. However, the nature of the data creates two major problems. Firstly, the imbalanced class problem due to the number of positive examples (pairs of proteins which really interact) being much smaller than the number of negative ones. Secondly, the selection of negative examples is based on some unreliable assumptions which could introduce some bias in the classification results. The first part of this dissertation addresses these drawbacks by exploring the use of one-class classification (OCC) methods to deal with the task of prediction of PPI. OCC methods utilize examples of just one class to generate a predictive model which is consequently independent of the kind of negative examples selected; additionally these approaches are known to cope with imbalanced class problems. We designed and carried out a performance evaluation study of several OCC methods for this task. We also undertook a comparative performance evaluation with several conventional learning techniques. Furthermore, we pay attention to a new potential drawback which appears to affect the performance of PPI prediction. This is associated with the composition of the positive gold standard set, which contain a high proportion of examples associated with interactions of ribosomal proteins. We demonstrate that this situation indeed biases the classification task, resulting in an over-optimistic performance result. The prediction of non-ribosomal PPI is a much more difficult task. We investigate some strategies in order to improve the performance of this subtask, integrating new kinds of data as well as combining diverse classification models generated from different sets of data. In this thesis, we undertook a preliminary validation study of the new PPI predicted by using OCC methods. To achieve this, we focus in three main aspects: look for biological evidence in the literature that support the new predictions; the analysis of predicted PPI networks properties; and the identification of highly interconnected groups of proteins which can be associated with new protein complexes. Finally, this thesis explores a slightly different area, related to the prediction of PPI types. This is associated with the classification of PPI structures (complexes) contained in the Protein Data Bank (PDB) data base according to its function and binding affinity. Considering the relatively reduced number of crystalized protein complexes available, it is not possible at the moment to link these results with the ones obtained previously for the prediction of PPI complexes. However, this could be possible in the near future when more PPI structures will be available.

José Antonio Reyes | J. A. Reyes

[1] David A. Gough,et al. Predicting protein-protein interactions from primary structure , 2001, Bioinform..

[2] Erhard Rahm,et al. A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[3] Fred P. Davis,et al. PIBASE: a comprehensive database of structurally defined protein interfaces , 2005, Bioinform..

[4] P. Bork,et al. Functional organization of the yeast proteome by systematic analysis of protein complexes , 2002, Nature.

[5] Ignacio Marín,et al. Iterative Cluster Analysis of Protein Interaction Data , 2005, Bioinform..

[6] M. Sanner,et al. Reduced surface: an efficient way to compute molecular surfaces. , 1996, Biopolymers.

[7] David M. J. Tax,et al. One-class classification , 2001 .

[8] Ludmila I. Kuncheva,et al. Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy , 2003, Machine Learning.

[9] Robert P. W. Duin,et al. Support Vector Data Description , 2004, Machine Learning.

[10] Gary D Bader,et al. Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry , 2002, Nature.

[11] Pedro M. Domingos. Prospects and challenges for multi-relational data mining , 2003, SKDD.

[12] J. Thornton,et al. Diversity of protein–protein interactions , 2003, The EMBO journal.

[13] AnHai Doan,et al. iMAP: Discovering Complex Mappings between Database Schemas. , 2004, SIGMOD 2004.

[14] J. Ross Quinlan,et al. C4.5: Programs for Machine Learning , 1992 .

[15] Yang Zhang,et al. A comprehensive assessment of sequence-based and template-based methods for protein contact prediction , 2008, Bioinform..

[16] Robert P. W. Duin,et al. The combining classifier: to train or not to train? , 2002, Object recognition supported by user interaction for service robots.

[17] William Stafford Noble,et al. Learning to predict protein-protein interactions from protein sequences , 2003, Bioinform..

[18] José A. Reyes,et al. Combining One-Class Classification Models Based on Diverse Biological Data for Prediction of Protein-Protein Interactions , 2008, DILS.

[19] Eric Bauer,et al. An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants , 1999, Machine Learning.

[20] Igor Jurisica,et al. Protein complex prediction via cost-based clustering , 2004, Bioinform..

[21] Huiru Zheng,et al. An assessment of machine and statistical learning approaches to inferring networks of protein-protein interactions , 2006, J. Integr. Bioinform..

[22] T. N. Bhat,et al. The Protein Data Bank , 2000, Nucleic Acids Res..

[23] Albert-László Barabási,et al. Hierarchical organization in complex networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[24] Jian Pei,et al. CMAR: accurate and efficient classification based on multiple class-association rules , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[25] Subhash C. Bagui,et al. Combining Pattern Classifiers: Methods and Algorithms , 2005, Technometrics.

[26] Tom L Blundell,et al. An algorithm for predicting protein–protein interaction sites: Abnormally exposed amino acid residues and secondary structure elements , 2006, Protein science : a publication of the Protein Society.

[27] Geoffrey J. Barton,et al. Protein sequence alignments: a strategy for the hierarchical analysis of residue conservation , 1993, Comput. Appl. Biosci..

[28] Ben Taskar,et al. Learning Probabilistic Models of Link Structure , 2003, J. Mach. Learn. Res..

[29] S. Fields,et al. Protein-protein interactions: methods for detection and analysis , 1995, Microbiological reviews.

[30] S. L. Wong,et al. Towards a proteome-scale map of the human protein–protein interaction network , 2005, Nature.

[31] Tim J. P. Hubbard,et al. SCOP database in 2004: refinements integrate structure and sequence family data , 2004, Nucleic Acids Res..

[32] J. Rothberg,et al. Gaining confidence in high-throughput protein interaction networks , 2004, Nature Biotechnology.

[33] Kara Dolinski,et al. Gene Ontology annotations at SGD: new data sources and annotation methods , 2007, Nucleic Acids Res..

[34] S. L. Wong,et al. A Map of the Interactome Network of the Metazoan C. elegans , 2004, Science.

[35] Ronald W. Davis,et al. A genome-wide transcriptional analysis of the mitotic cell cycle. , 1998, Molecular cell.

[36] David Gilbert,et al. Prediction of protein-protein interactions using one-class classification methods and integrating diverse biological data , 2007, J. Integr. Bioinform..

[37] Nasser M. Nasrabadi,et al. Pattern Recognition and Machine Learning , 2006, Technometrics.

[38] Andreas Wagner,et al. A statistical framework for combining and interpreting proteomic datasets , 2004, Bioinform..

[39] Robert Gentleman,et al. Local modeling of global interactome networks , 2005 .

[40] Robert Tibshirani,et al. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[41] David R. Westhead,et al. Improved prediction of protein-protein binding sites using a support vector machines approach. , 2005, Bioinformatics.

[42] Vasant Honavar,et al. Information extraction and integration from heterogeneous, distributed, autonomous information sources - a federated ontology-driven query-centric approach , 2003, Proceedings Fifth IEEE Workshop on Mobile Computing Systems and Applications.

[43] A. Barabasi,et al. Network biology: understanding the cell's functional organization , 2004, Nature Reviews Genetics.

[44] M. Gerstein,et al. Assessing the limits of genomic data integration for predicting protein networks. , 2005, Genome research.

[45] Yoshihiro Yamanishi,et al. Protein network inference from multiple genomic data: a supervised approach , 2004, ISMB/ECCB.

[46] Mark Gerstein,et al. Information assessment on predicting protein-protein interactions , 2004, BMC Bioinformatics.

[47] C. Chothia,et al. Principles of protein–protein recognition , 1975, Nature.

[48] Benjamin A. Shoemaker,et al. Deciphering Protein–Protein Interactions. Part I. Experimental Techniques and Databases , 2007, PLoS Comput. Biol..

[49] Yanjun Qi,et al. Random Forest Similarity for Protein-Protein Interaction Prediction from Multiple Sources , 2004, Pacific Symposium on Biocomputing.

[50] Richard Simon,et al. Bias in error estimation when using cross-validation for model selection , 2006, BMC Bioinformatics.

[51] Gustavo A. Stolovitzky,et al. Bioinformatics: The Machine Learning Approach , 2002 .

[52] Safaai Deris,et al. One-Class Support Vector Machines for Protein- Protein Interactions Prediction , 2007 .

[53] Nitesh V. Chawla,et al. Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[54] Hongbo Zhu,et al. NOXclass: prediction of protein-protein interaction types , 2006, BMC Bioinformatics.

[55] Yudong D. He,et al. Functional Discovery via a Compendium of Expression Profiles , 2000, Cell.

[56] J. Skolnick,et al. Prediction of physical protein–protein interactions , 2005, Physical biology.

[57] R. Fisher. On the Interpretation of χ 2 from Contingency Tables , and the Calculation of P Author , 2022 .

[58] Lise Getoor,et al. Learning Probabilistic Relational Models , 1999, IJCAI.

[59] Ethem Alpaydin,et al. Introduction to Machine Learning (Adaptive Computation and Machine Learning) , 2004 .

[60] Shi-Hua Zhang,et al. Prediction of Protein Complexes Based on Protein Interaction Data and Functional Annotation Data Using Kernel Methods , 2006, ICIC.

[61] Huiru Zheng,et al. An assessment of machine and statistical learning approaches to inferring networks of protein-protein interactions , 2006, J. Integr. Bioinform..

[62] M. Ashburner,et al. Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[63] Ting Chen,et al. An integrated probabilistic model for functional prediction of proteins , 2003, RECOMB '03.

[64] Frederick P. Roth,et al. Predicting co-complexed protein pairs using genomic and proteomic data integration , 2004, BMC Bioinformatics.

[65] Mark Gerstein,et al. Bridging structural biology and genomics: assessing protein interaction data with known complexes. , 2002, Trends in genetics : TIG.

[66] Ron Kohavi,et al. Bias Plus Variance Decomposition for Zero-One Loss Functions , 1996, ICML.

[67] Ilya A Vakser,et al. Protein-protein interfaces are special. , 2004, Structure.

[68] Jer-Ming Chia,et al. Implications for domain fusion protein-protein interactions based on structural information , 2004, BMC Bioinformatics.

[69] M. Vidal,et al. Effect of sampling on topology predictions of protein-protein interaction networks , 2005, Nature Biotechnology.

[70] Mainak Guharoy,et al. Secondary structure based analysis and classification of biological interfaces: identification of binding motifs in protein-protein interactions , 2007, Bioinform..

[71] Christos Faloutsos,et al. Tools for large graph mining , 2005 .

[72] Christian von Mering,et al. STRING: known and predicted protein–protein associations, integrated and transferred across organisms , 2004, Nucleic Acids Res..

[73] Ziv Bar-Joseph,et al. Evaluation of different biological data and computational classification methods for use in protein interaction prediction , 2006, Proteins.

[74] Thomas G. Dietterich. An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[75] Charles X. Ling,et al. Using AUC and accuracy in evaluating learning algorithms , 2005, IEEE Transactions on Knowledge and Data Engineering.

[76] Björn Olsson,et al. Artificial intelligence techniques for bioinformatics. , 2002, Applied bioinformatics.

[77] R. Raz,et al. ProMate: a structure based prediction program to identify the location of protein-protein binding sites. , 2004, Journal of molecular biology.

[78] Alexander Rives,et al. Modular organization of cellular networks , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[79] Jérôme Wojcik,et al. Protein-protein interaction map inference using interacting domain profile pairs , 2001, ISMB.

[80] D. Botstein,et al. Genomic expression programs in the response of yeast cells to environmental changes. , 2000, Molecular biology of the cell.

[81] Xiaolong Wang,et al. Protein-protein interaction site prediction based on conditional random fields , 2007, Bioinform..

[82] Eran Segal,et al. Session Introduction: Joint Learning from Multiple Types of Genomic Data , 2005, Pacific Symposium on Biocomputing.

[83] B. Snel,et al. Comparative assessment of large-scale data sets of protein–protein interactions , 2002, Nature.

[84] James R. Knight,et al. A Protein Interaction Map of Drosophila melanogaster , 2003, Science.

[85] L. Mirny,et al. Protein complexes and functional modules in molecular networks , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[86] Joydeep Ghosh,et al. A distributed learning framework for heterogeneous data sources , 2005, KDD '05.

[87] Haidong Wang,et al. Identifying Protein-Protein Interaction Sites on a Genome-Wide Scale , 2004, NIPS.

[88] Heekuck Oh,et al. Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[89] Ian H. Witten,et al. Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[90] Ramakrishnan Srikant,et al. Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[91] Subbarao Kambhampati,et al. Integration of biological sources: current systems and challenges ahead , 2004, SGMD.

[92] G. Yule,et al. On the association of attributes in statistics, with examples from the material of the childhood society, &c , 1900, Proceedings of the Royal Society of London.

[93] James R. Knight,et al. A comprehensive analysis of protein–protein interactions in Saccharomyces cerevisiae , 2000, Nature.

[94] P. Bork,et al. Proteome survey reveals modularity of the yeast cell machinery , 2006, Nature.

[95] Wei Chu,et al. Identifying Protein Complexes in High-Throughput Protein Interaction Screens Using an Infinite Latent Feature Model , 2005, Pacific Symposium on Biocomputing.

[96] Michael J. Pazzani,et al. Error reduction through learning multiple descriptions , 2004, Machine Learning.

[97] Edward Keedwell,et al. Intelligent Bioinformatics: The Application of Artificial Intelligence Techniques to Bioinformatics Problems , 2005 .

[98] Mark A. Hall,et al. Correlation-based Feature Selection for Machine Learning , 2003 .

[99] William Stafford Noble,et al. Kernel methods for predicting protein-protein interactions , 2005, ISMB.

[100] Gary D Bader,et al. BIND--The Biomolecular Interaction Network Database. , 2001, Nucleic acids research.

[101] D. Goldberg,et al. Assessing experimentally derived interactions in a small world , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[102] Trevor Hastie,et al. The Elements of Statistical Learning , 2001 .

[103] Benno Schwikowski,et al. Graph-based methods for analysing networks in cell biology , 2006, Briefings Bioinform..

[104] R. Fisher. On the Interpretation of χ2 from Contingency Tables, and the Calculation of P , 2018, Journal of the Royal Statistical Society Series A (Statistics in Society).

[105] Y. Zhang,et al. IntAct—open source resource for molecular interaction data , 2006, Nucleic Acids Res..

[106] Thomas G. Dietterich. Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[107] Mykola Pechenizkiy,et al. Diversity in search strategies for ensemble feature selection , 2005, Inf. Fusion.

[108] Philip S. Yu,et al. CrossMine: efficient classification across multiple database relations , 2004, Proceedings. 20th International Conference on Data Engineering.

[109] Gary D. Bader,et al. An automated method for finding molecular complexes in large protein interaction networks , 2003, BMC Bioinformatics.

[110] Jian Wang,et al. Protein interaction networks of Saccharomyces cerevisiae, Caenorhabditis elegans and Drosophila melanogaster: Large‐scale organization and robustness , 2006, Proteomics.

[111] Pierre Baldi,et al. A machine learning information retrieval approach to protein fold recognition. , 2006, Bioinformatics.

[112] Shailesh V. Date,et al. A Probabilistic Functional Network of Yeast Genes , 2004, Science.

[113] A. Giuliani,et al. A computational approach identifies two regions of Hepatitis C Virus E1 protein as interacting domains involved in viral fusion process , 2009, BMC Structural Biology.

[114] P. Shannon,et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. , 2003, Genome research.

[115] Yanjun Qi,et al. Protein complex identification by supervised graph local clustering , 2008, ISMB.

[116] Wynne Hsu,et al. Integrating Classification and Association Rule Mining , 1998, KDD.

[117] Erich E. Wanker,et al. Comparison of Human Protein-Protein Interaction Maps , 2007, German Conference on Bioinformatics.

[118] J. Thornton,et al. Protein–protein interfaces: Analysis of amino acid conservation in homodimers , 2001, Proteins.

[119] Benjamin A. Shoemaker,et al. Deciphering Protein–Protein Interactions. Part II. Computational Methods to Predict Protein and Domain Interaction Partners , 2007, PLoS Comput. Biol..

[120] Vasant Honavar,et al. A Framework for Learning from Distributed Data Using Sufficient Statistics and Its Application to Learning Decision Trees , 2004, Int. J. Hybrid Intell. Syst..

[121] A. Barabasi,et al. Hierarchical Organization of Modularity in Metabolic Networks , 2002, Science.

[122] F. Wilcoxon. Individual Comparisons by Ranking Methods , 1945 .

[123] D. Ruppert. The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[124] Steven Skiena,et al. Heterogeneous Data Integration with the Consensus Clustering Formalism , 2004, DILS.

[125] Dmitrij Frishman,et al. MIPS: a database for genomes and protein sequences , 2000, Nucleic Acids Res..

[126] Minghua Deng,et al. Inferring Domain–Domain Interactions From Protein–Protein Interactions , 2002 .

[127] Huan-Xiang Zhou,et al. Interaction-site prediction for protein complexes: a critical assessment , 2007, Bioinform..

[128] T. Ideker,et al. Comprehensive curation and analysis of global interaction networks in Saccharomyces cerevisiae , 2006, Journal of biology.

[129] L. Wong,et al. Technologies for Integrating Biological Data , 2002, Briefings Bioinform..

[130] T. Takagi,et al. Prediction of protein-protein interaction sites using support vector machines. , 2004, Protein engineering, design & selection : PEDS.

[131] Vladimir Vapnik,et al. Statistical learning theory , 1998 .

[132] S. Jones,et al. Analysis of protein-protein interaction sites using surface patches. , 1997, Journal of molecular biology.

[133] W. Kabsch,et al. Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[134] Anton J. Enright,et al. Detection of functional modules from protein interaction networks , 2003, Proteins.

[135] Mei Liu,et al. Prediction of protein-protein interactions using random decision forest framework , 2005, Bioinform..

[136] Pierre Baldi,et al. SCRATCH: a protein structure and structural feature prediction server , 2005, Nucleic Acids Res..

[137] Jing Zhu,et al. Edge-based scoring and searching method for identifying condition-responsive protein-protein interaction sub-network , 2007, Bioinform..

[138] William Stafford Noble,et al. Choosing negative examples for the prediction of protein-protein interactions , 2006, BMC Bioinformatics.

[139] Pierre Baldi,et al. Bioinformatics - the machine learning approach (2. ed.) , 2000 .

[140] Illés J. Farkas,et al. CFinder: locating cliques and overlapping modules in biological networks , 2006, Bioinform..

[141] Z. Weng,et al. Structure, function, and evolution of transient and obligate protein-protein interactions. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[142] S. Jones,et al. Principles of protein-protein interactions. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[143] Silvio C. E. Tosatto,et al. The SSEA server for protein secondary structure alignment , 2005, Bioinform..

[144] Xin Yao,et al. An analysis of diversity measures , 2006, Machine Learning.

[145] David R. Gilbert,et al. Protein structure comparison based o n profiles of topological motifs: a feasible way to deal with information from negative examples , 2003, German Conference on Bioinformatics.

[146] Bernhard Schölkopf,et al. Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[147] Pedro M. Domingos. A Unified Bias-Variance Decomposition for Zero-One and Squared Loss , 2000, AAAI/IAAI.

[148] Baldomero Oliva,et al. Prediction of protein-protein interactions using distant conservation of sequence patterns and structure relationships , 2005, Bioinform..

[149] Fidel Ramírez,et al. Computing topological parameters of biological networks , 2008, Bioinform..

[150] E Mjolsness,et al. Machine learning for science: state of the art and future prospects. , 2001, Science.

[151] Jacques van Helden,et al. Evaluation of clustering algorithms for protein-protein interaction networks , 2006, BMC Bioinformatics.

[152] M. Gerstein,et al. A Bayesian Networks Approach for Predicting Protein-Protein Interactions from Genomic Data , 2003, Science.

[153] L. Stein. Integrating biological databases , 2003, Nature Reviews Genetics.

[154] Ian M. Donaldson,et al. BIND: THE BIOMOLECULAR INTERACTION DATABASE , 2001 .

[155] Tin Kam Ho,et al. The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[156] Albert Y. Zomaya,et al. Improved general regression network for protein domain boundary prediction , 2007, BMC Bioinformatics.

[157] Marcel J. T. Reinders,et al. Protein Complex Prediction Using an Integrative Bioinformatics Approach , 2007, J. Bioinform. Comput. Biol..

[158] R. Ozawa,et al. A comprehensive two-hybrid analysis to explore the yeast protein interactome , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[159] Roded Sharan,et al. Identification of protein complexes by comparative analysis of yeast and bacterial protein interaction data , 2004, J. Comput. Biol..

[160] A. Barabasi,et al. Lethality and centrality in protein networks , 2001, Nature.

[161] Saso Dzeroski,et al. Multi-relational data mining: an introduction , 2003, SKDD.

[162] P. Bork,et al. Bioinformatics in the post-sequence era , 2003, Nature Genetics.

[163] Ting Chen,et al. An Integrated Probabilistic Model for Functional Prediction of Proteins , 2004, J. Comput. Biol..

[164] A. Barabasi,et al. Functional and topological characterization of protein interaction networks , 2004, Proteomics.

[165] Ben Taskar,et al. Learning Probabilistic Models of Relational Structure , 2001, ICML.

[166] Chris Drummond,et al. Learning to Live with False Alarms , 2005 .

[167] R. Tibshirani,et al. An introduction to the bootstrap , 1993 .

[168] D. Koller,et al. InSite: a computational method for identifying protein-protein interaction binding sites on a proteome-wide scale , 2007, Genome Biology.

[169] Frank Dudbridge,et al. The Use of Edge-Betweenness Clustering to Investigate Biological Function in Protein Interaction Networks , 2005, BMC Bioinformatics.

[170] Shai Ben-David,et al. A theoretical framework for learning from a pool of disparate data sources , 2002, KDD.

[171] Ruth Nussinov,et al. Analysis of ordered and disordered protein complexes reveals structural features discriminating between stable and unstable monomers. , 2004, Journal of molecular biology.

[172] Nello Cristianini,et al. A statistical framework for genomic data fusion , 2004, Bioinform..