Towards Data Analytics of Pathogen-Host Protein-Protein Interaction: A Survey

"Big Data" is immersed in many disciplines, including computer vision, economics, online resources, bioinformatics and so on. Increasing researches are conducted on data mining and machine learning for uncovering and predicting related domain knowledge. Protein-protein interaction is one of the main areas in bioinformatics as it is the basis of the biological functions. However, most pathogen-host protein-protein interactions, which would be able to reveal much more infectious mechanisms between pathogen and host, are still up for further investigation. Considering a decent feature representation of pathogen-host protein-protein interactions (PHPPI), currently there is not a well structured database for research purposes, not even for infection mechanism studies for different species of pathogens. In this paper, we will survey the PHPPI researches and construct a public PHPPI dataset by ourselves for future research. It results in an utterly big and imbalanced data set associated with high dimension and large quantity. Several machine learning methodologies are also discussed in this paper to imply possible analytics solutions in near future. This paper contributes to a new, yet challenging, research area in applying data analytic technologies in bioinformatics, by learning and predicting pathogen-host protein-protein interactions.

[1]  Manfred Huber,et al.  Using deep learning to enhance cancer diagnosis and classication , 2013 .

[2]  Kyungsook Han,et al.  Sequence-based prediction of protein-protein interactions by means of rotation forest and autocorrelation descriptor. , 2010, Protein and peptide letters.

[3]  Lincoln Stein,et al.  Reactome: a database of reactions, pathways and biological processes , 2010, Nucleic Acids Res..

[4]  Jaime G. Carbonell,et al.  Multisource transfer learning for host-pathogen protein interaction prediction in unlabeled tasks , 2013 .

[5]  Adam Godzik,et al.  Tolerating some redundancy significantly speeds up clustering of large protein databases , 2002, Bioinform..

[6]  Dmitrij Frishman,et al.  The Negatome database: a reference set of non-interacting protein pairs , 2009, Nucleic Acids Res..

[7]  M. Gerstein,et al.  A Bayesian Networks Approach for Predicting Protein-Protein Interactions from Genomic Data , 2003, Science.

[8]  T. M. Murali,et al.  Computational prediction of host-pathogen protein-protein interactions , 2007, ISMB/ECCB.

[9]  Jaime G. Carbonell,et al.  Multitask learning for host–pathogen protein interactions , 2013, Bioinform..

[10]  Alex Alves Freitas,et al.  Optimizing amino acid groupings for GPCR classification , 2008, Bioinform..

[11]  Judith Klein-Seetharaman,et al.  Comparing human–Salmonella with plant–Salmonella protein–protein interaction predictions , 2015, Front. Microbiol..

[12]  Hong Zhao,et al.  Supervised Machine Learning Model for High Dimensional Gene Data in Colon Cancer Detection , 2015, 2015 IEEE International Congress on Big Data.

[13]  Juwen Shen,et al.  Predicting protein–protein interactions based only on sequences information , 2007, Proceedings of the National Academy of Sciences.

[14]  Javad Zahiri,et al.  Computational Prediction of Protein–Protein Interaction Networks: Algo-rithms and Resources , 2013, Current genomics.

[15]  Zhen Ji,et al.  Large-Scale Protein-Protein Interactions Detection by Integrating Big Biosensing Data with Computational Model , 2014, BioMed research international.

[16]  Jie Tan,et al.  Big Data Bioinformatics , 2014, Journal of cellular physiology.

[17]  Neil Savage,et al.  Bioinformatics: Big Data Versus the Big C , 2014 .

[18]  S. Eddy,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[19]  Adam Godzik,et al.  Clustering of highly homologous sequences to reduce the size of large protein databases , 2001, Bioinform..

[20]  Mario Rosario Guarracino,et al.  Predicting Protein-Protein Interactions with K-Nearest Neighbors Classification Algorithm , 2009, CIBB.

[21]  Zhu-Hong You,et al.  Prediction of protein-protein interactions from amino acid sequences with ensemble extreme learning machines and principal component analysis , 2013, BMC Bioinformatics.

[22]  Jaime G. Carbonell,et al.  Techniques to cope with missing data in host–pathogen protein interaction prediction , 2012, Bioinform..

[23]  Alan L. Yuille,et al.  The Concave-Convex Procedure , 2003, Neural Computation.

[24]  Yanzhi Guo,et al.  Using support vector machine combined with auto covariance to predict protein–protein interactions from protein sequences , 2008, Nucleic acids research.

[25]  The Uniprot Consortium,et al.  UniProt: a hub for protein information , 2014, Nucleic Acids Res..

[26]  Jane Labadin,et al.  Feature selection based on mutual information , 2015, 2015 9th International Conference on IT in Asia (CITA).

[27]  Hao Zhu,et al.  A novel one-class SVM based negative data sampling method for reconstructing proteome-wide HTLV-human protein interaction networks , 2015, Scientific Reports.

[28]  Marc'Aurelio Ranzato,et al.  Building high-level features using large scale unsupervised learning , 2011, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[29]  Geoffrey J. Barton,et al.  Probabilistic prediction and ranking of human protein-protein interactions , 2007, BMC Bioinformatics.

[30]  Bindu Nanduri,et al.  HPIDB - a unified resource for host-pathogen interactions , 2010, BMC Bioinformatics.

[31]  Judith Klein-Seetharaman,et al.  Techniques for transferring host-pathogen protein interactions knowledge to new tasks , 2015, Front. Microbiol..

[32]  Yanjun Qi,et al.  Random Forest Similarity for Protein-Protein Interaction Prediction from Multiple Sources , 2004, Pacific Symposium on Biocomputing.

[33]  Kenji Mizuguchi,et al.  Homology-based prediction of interactions between proteins using Averaged One-Dependence Estimators , 2014, BMC Bioinformatics.

[34]  María Martín,et al.  UniProt: A hub for protein information , 2015 .

[35]  N Srinivasan,et al.  Prediction of protein-protein interactions between human host and a pathogen and its application to three pathogenic bacteria. , 2011, International journal of biological macromolecules.

[36]  Xiaolong Guo,et al.  [Study of decision tree in the application of predicting protein-protein interactions]. , 2013, Sheng wu yi xue gong cheng xue za zhi = Journal of biomedical engineering = Shengwu yixue gongchengxue zazhi.

[37]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[38]  Fatih Erdogan Sevilgen,et al.  PHISTO: pathogen-host interaction search tool , 2013, Bioinform..

[39]  Johannes Goll,et al.  Protein interaction data curation: the International Molecular Exchange (IMEx) consortium , 2012, Nature Methods.

[40]  Yan Zhang,et al.  PATRIC, the bacterial bioinformatics database and analysis resource , 2013, Nucleic Acids Res..

[41]  Alberto Calderone,et al.  VirusMentha: a new resource for virus-host protein interactions , 2014, Nucleic Acids Res..

[42]  Philip S. Yu,et al.  G-SESAME: web tools for GO-term-based gene similarity analysis and knowledge discovery , 2009, Nucleic Acids Res..

[43]  Yanjun Qi,et al.  Prediction of Interactions Between HIV-1 and Human Proteins by Information Integration , 2008, Pacific Symposium on Biocomputing.

[44]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[45]  Jason Weston,et al.  Semi-supervised multi-task learning for predicting interactions between HIV-1 and human proteins , 2010, Bioinform..

[46]  Kyungsook Han,et al.  Prediction of protein-protein interactions between viruses and human by an SVM model , 2012, BMC Bioinformatics.

[47]  Kenneth H. Buetow,et al.  PID: the Pathway Interaction Database , 2008, Nucleic Acids Res..

[48]  Zhu-Hong You,et al.  A SVM-Based System for Predicting Protein-Protein Interactions Using a Novel Representation of Protein Sequences , 2013, ICIC.

[49]  Francisco M. Couto,et al.  Annotation extension through protein family annotation coherence metrics , 2013, Front. Genet..

[50]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[51]  Ben M. Webb,et al.  ModBase, a database of annotated comparative protein structure models and associated resources , 2013, Nucleic Acids Res..

[52]  Dennis B. Troup,et al.  NCBI GEO: mining tens of millions of expression profiles—database and tools update , 2006, Nucleic Acids Res..

[53]  J L Sussman,et al.  Protein Data Bank (PDB): database of three-dimensional structural information of biological macromolecules. , 1998, Acta crystallographica. Section D, Biological crystallography.

[54]  Christian Gautier,et al.  VirHostNet: a knowledge base for the management and the analysis of proteome-wide virus–host interaction networks , 2008, Nucleic Acids Res..

[55]  V. Marx Biology: The big challenges of big data , 2013, Nature.

[56]  Sandhya Rani,et al.  Human Protein Reference Database—2009 update , 2008, Nucleic Acids Res..

[57]  B. Honig,et al.  Structure-based prediction of protein-protein interactions on a genome-wide scale , 2012, Nature.