A MapReduce based parallel SVM for large-scale predicting protein-protein interactions

Abstract Protein–protein interactions (PPIs) are crucial to most biochemical processes, including metabolic cycles, DNA transcription and replication, and signaling cascades. Although large amount of protein–protein interaction data for different species has been generated by high-throughput experimental techniques, the number is still limited compared to the total number of possible PPIs. Furthermore, the experimental methods for identifying PPIs are both time-consuming and expensive. Therefore, it is urgent and challenging to develop automated computational methods to efficiently and accurately predict PPIs. In this article, we propose a novel MapReduce-based parallel SVM model for large-scale predicting protein–protein interactions only using the information of protein sequences. First, the local sequential features represented by autocorrelation descriptor are extracted from protein sequences. Then the MapReduce framework is employed to train support vector machine (SVM) classifiers in a distributed way, obtaining significant improvement in training time while maintaining a high level of accuracy. The experimental results demonstrate that the proposed parallel algorithms not only can tackle large-scale PPIs dataset, but also perform well in terms of the evaluation metrics of speedup and accuracy. Consequently, the proposed approach can be considered as a new promising and powerful tools for large-scale predicting PPI with excellent performance and less time.

[1]  François Poulet,et al.  Speed Up SVM Algorithm for Massive Classification Tasks , 2008, ADMA.

[2]  De-Shuang Huang,et al.  Predicting protein–protein interactions from sequence using correlation coefficient and high-quality interaction dataset , 2010, Amino Acids.

[3]  Hongbin Shen,et al.  Large-scale prediction of human protein-protein interactions from amino acid sequence based on latent topic features. , 2010, Journal of proteome research.

[4]  S. L. Wong,et al.  A Map of the Interactome Network of the Metazoan C. elegans , 2004, Science.

[5]  P. Bork,et al.  Functional organization of the yeast proteome by systematic analysis of protein complexes , 2002, Nature.

[6]  Xing-Ming Zhao,et al.  FPPI: Fusarium graminearum protein-protein interaction database. , 2009, Journal of proteome research.

[7]  Jie Gui,et al.  Prediction of protein-protein interactions from protein sequence using local descriptors. , 2010, Protein and peptide letters.

[8]  Stefan Rüping,et al.  Incremental Learning with Support Vector Machines , 2001, ICDM.

[9]  K. Aihara,et al.  A discriminative approach for identifying domain–domain interactions from protein–protein interactions , 2010, Proteins.

[10]  Yuri Matsuzaki,et al.  Docking-calculation-based method for predicting protein-RNA interactions. , 2011, Genome informatics. International Conference on Genome Informatics.

[11]  K. Aihara,et al.  Uncovering signal transduction networks from high-throughput data by integer linear programming , 2008, Nucleic acids research.

[12]  Toshiyuki Sato,et al.  In silico Screening of protein-protein Interactions with All-to-All Rigid docking and Clustering: an Application to Pathway Analysis , 2009, J. Bioinform. Comput. Biol..

[13]  Minoru Kanehisa,et al.  AAindex: amino acid index database, progress report 2008 , 2007, Nucleic Acids Res..

[14]  Sean R. Collins,et al.  Global landscape of protein complexes in the yeast Saccharomyces cerevisiae , 2006, Nature.

[15]  M. Moran,et al.  Large-scale mapping of human protein–protein interactions by mass spectrometry , 2007, Molecular systems biology.

[16]  Gary D Bader,et al.  Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry , 2002, Nature.

[17]  James R. Knight,et al.  A comprehensive analysis of protein–protein interactions in Saccharomyces cerevisiae , 2000, Nature.

[18]  Kurt Keutzer,et al.  Fast support vector machine training and classification on graphics processors , 2008, ICML '08.

[19]  Xiaobo Zhou,et al.  A semi-supervised learning approach to predict synthetic genetic interactions by combining functional and topological properties of functional gene network , 2010, BMC Bioinformatics.

[20]  Juwen Shen,et al.  Predicting protein–protein interactions based only on sequences information , 2007, Proceedings of the National Academy of Sciences.

[21]  James R. Knight,et al.  A Protein Interaction Map of Drosophila melanogaster , 2003, Science.

[22]  Zhen Ji,et al.  Assessing and predicting protein interactions by combining manifold embedding with multiple information integration , 2012, BMC Bioinformatics.

[23]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[24]  Xin Li,et al.  Protein classification with imbalanced data , 2007, Proteins.

[25]  Zhu-Hong You,et al.  Increasing Reliability of Protein Interactome by Combining Heterogeneous Data Sources with Weighted Network Topological Metrics , 2010, ICIC.

[26]  Luca Zanni,et al.  A parallel solver for large quadratic programs in training support vector machines , 2003, Parallel Comput..

[27]  Xing-Ming Zhao,et al.  Improved method for predicting phi-turns in proteins using a two-stage classifier. , 2010, Protein and peptide letters.

[28]  Kyungsook Han,et al.  Sequence-based prediction of protein-protein interactions by means of rotation forest and autocorrelation descriptor. , 2010, Protein and peptide letters.

[29]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[30]  P. Broto,et al.  Molecular structures: perception, autocorrelation descriptor and sar studies. Autocorrelation descriptor , 1984 .

[31]  A. Emili,et al.  Interaction network containing conserved and essential protein complexes in Escherichia coli , 2005, Nature.

[32]  Toshiyuki Sato,et al.  MEGADOCK 3.0: a high-performance protein-protein interaction prediction software using hybrid parallel computing for petascale supercomputing environments , 2013, Source Code for Biology and Medicine.

[33]  Dmitrij Frishman,et al.  The Negatome database: a reference set of non-interacting protein pairs , 2009, Nucleic Acids Res..

[34]  P. Moran Notes on continuous stochastic phenomena. , 1950, Biometrika.

[35]  Huan Liu,et al.  Handling concept drifts in incremental learning with support vector machines , 1999, KDD '99.

[36]  A. Barabasi,et al.  High-Quality Binary Protein Interaction Map of the Yeast Interactome Network , 2008, Science.

[37]  Zhu-Hong You,et al.  Using manifold embedding for assessing and predicting protein interactions from high-throughput experimental data , 2010, Bioinform..

[38]  R. Ozawa,et al.  A comprehensive two-hybrid analysis to explore the yeast protein interactome , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[39]  Samy Bengio,et al.  A Parallel Mixture of SVMs for Very Large Scale Problems , 2001, Neural Computation.

[40]  Yanzhi Guo,et al.  Using support vector machine combined with auto covariance to predict protein–protein interactions from protein sequences , 2008, Nucleic acids research.