Efficiently predicting large-scale protein-protein interactions using MapReduce

With a rapid development of high-throughput genomic technologies, a vast amount of protein-protein interactions (PPIs) data has been generated for difference species. However, such set of PPIs is rather small when compared with all possible PPIs. Hence, there is a necessity to specifically develop computational algorithms for large-scale PPI prediction. In response to this need, we propose a parallel algorithm, namely pVLASPD, to perform the prediction task in a distributed manner. In particular, pVLASPD was modified based on the VLASPD algorithm for the purpose of improving the efficiency of VLASPD while maintaining a comparable effectiveness. To do so, we first analyzed VLASPD step by step to identify the places that caused the bottlenecks of efficiency. After that, pVLASPD was developed by parallelizing those inefficient places with the framework of MapReduce. The extensive experimental results demonstrate the promising performance of pVLASPD when applied to prediction of large-scale PPIs.

[1]  Keith C. C. Chan,et al.  Discovering Variable-Length Patterns in Protein Sequences for Protein-Protein Interaction Prediction , 2015, IEEE Transactions on NanoBioscience.

[2]  Yungki Park,et al.  Critical assessment of sequence-based protein-protein interaction prediction methods that do not require homologous protein sequences , 2009, BMC Bioinformatics.

[3]  Reza Ebrahimpour,et al.  PPIevo: protein-protein interaction prediction from PSSM based evolutionary information. , 2013, Genomics.

[4]  Adam J. Smith,et al.  The Database of Interacting Proteins: 2004 update , 2004, Nucleic Acids Res..

[5]  T. Pawson,et al.  Assembly of Cell Regulatory Systems Through Protein Interaction Domains , 2003, Science.

[6]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[7]  Sandhya Rani,et al.  Human Protein Reference Database—2009 update , 2008, Nucleic Acids Res..

[8]  Jean-Loup Faulon,et al.  Predicting protein-protein interactions using signature products , 2005, Bioinform..

[9]  David A. Gough,et al.  Predicting protein-protein interactions from primary structure , 2001, Bioinform..

[10]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[11]  Albert Chan,et al.  PIPE: a protein-protein interaction prediction engine based on the re-occurring short polypeptide sequences between known interacting protein pairs , 2006, BMC Bioinformatics.

[12]  Shuai Li,et al.  A MapReduce based parallel SVM for large-scale predicting protein-protein interactions , 2014, Neurocomputing.

[13]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .