Leveraging Stacked Denoising Autoencoder in Prediction of Pathogen-Host Protein-Protein Interactions

In big data research related to bioinformatics, one of the most critical areas is proteomics. In this paper, we focus on the protein-protein interactions, especially on pathogen-host protein-protein interactions (PHPPIs), which reveals the critical molecular process in biology. Conventionally, biologists apply in-lab methods, including small-scale biochemical, biophysical, genetic experiments and large-scale experiment methods (e.g. yeast-two-hybrid analysis), to identify the interactions. These in-lab methods are time consuming and labor intensive. Since the interactions between proteins from different species play very critical roles for both the infectious diseases and drug design, the motivation behind this study is to provide a basic framework for biologists, which is based on big data analytics and deep learning models. Our work contributes in leveraging unsupervised learning model, in which we focus on stacked denoising autoencoders, to achieve a more efficient prediction performance on PHPPI. In this paper, we further detail the framework based on unsupervised learning model for PHPPI researches, while curating a large imbalanced PHPPI dataset. Our model demonstrates a better result with the unsupervised learning model on PHPPI dataset.

[1]  Jaime G. Carbonell,et al.  Multisource transfer learning for host-pathogen protein interaction prediction in unlabeled tasks , 2013 .

[2]  Jiangning Song,et al.  Towards Data Analytics of Pathogen-Host Protein-Protein Interaction: A Survey , 2016, 2016 IEEE International Congress on Big Data (BigData Congress).

[3]  A. Pandey,et al.  Human Protein Reference Database and Human Proteinpedia as resources for phosphoproteome analysis. , 2012, Molecular bioSystems.

[4]  Viju Raghupathi,et al.  Big data analytics in healthcare: promise and potential , 2014, Health Information Science and Systems.

[5]  Byunghan Lee,et al.  Deep learning in bioinformatics , 2016, Briefings Bioinform..

[6]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[7]  Juwen Shen,et al.  Predicting protein–protein interactions based only on sequences information , 2007, Proceedings of the National Academy of Sciences.

[8]  Yan Zhang,et al.  PATRIC, the bacterial bioinformatics database and analysis resource , 2013, Nucleic Acids Res..

[9]  Zhen Ji,et al.  Large-Scale Protein-Protein Interactions Detection by Integrating Big Biosensing Data with Computational Model , 2014, BioMed research international.

[10]  Jie Tan,et al.  Big Data Bioinformatics , 2014, Journal of cellular physiology.

[11]  Neil Savage,et al.  Bioinformatics: Big Data Versus the Big C , 2014 .

[12]  María Martín,et al.  UniProt: A hub for protein information , 2015 .

[13]  Kyungsook Han,et al.  Prediction of protein-protein interactions between viruses and human by an SVM model , 2012, BMC Bioinformatics.

[14]  Zhu-Hong You,et al.  Prediction of protein-protein interactions from amino acid sequences with ensemble extreme learning machines and principal component analysis , 2013, BMC Bioinformatics.

[15]  Jaime G. Carbonell,et al.  Techniques to cope with missing data in host–pathogen protein interaction prediction , 2012, Bioinform..

[16]  Fatih Erdogan Sevilgen,et al.  PHISTO: pathogen-host interaction search tool , 2013, Bioinform..

[17]  Zhu-Hong You,et al.  A SVM-Based System for Predicting Protein-Protein Interactions Using a Novel Representation of Protein Sequences , 2013, ICIC.

[18]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[19]  Juancarlos Chan,et al.  Gene Ontology Consortium: going forward , 2014, Nucleic Acids Res..

[20]  Ghassan Beydoun,et al.  Profiling and Supporting Adaptive Micro Learning on Open Education Resources , 2016, 2016 International Conference on Advanced Cloud and Big Data (CBD).

[21]  N Srinivasan,et al.  Prediction of protein-protein interactions between human host and a pathogen and its application to three pathogenic bacteria. , 2011, International journal of biological macromolecules.

[22]  B. Honig,et al.  Structure-based prediction of protein-protein interactions on a genome-wide scale , 2012, Nature.

[23]  M. Ritchie,et al.  Methods of integrating data to uncover genotype–phenotype interactions , 2015, Nature Reviews Genetics.

[24]  Sean R. Davis,et al.  NCBI GEO: archive for functional genomics data sets—update , 2012, Nucleic Acids Res..

[25]  R. Ozawa,et al.  A comprehensive two-hybrid analysis to explore the yeast protein interactome , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[26]  Yanzhi Guo,et al.  Using support vector machine combined with auto covariance to predict protein–protein interactions from protein sequences , 2008, Nucleic acids research.

[27]  The Uniprot Consortium,et al.  UniProt: a hub for protein information , 2014, Nucleic Acids Res..

[28]  Alex Alves Freitas,et al.  Optimizing amino acid groupings for GPCR classification , 2008, Bioinform..

[29]  Sinan Kalkan,et al.  Deep Hierarchies in the Primate Visual Cortex: What Can We Learn for Computer Vision? , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[31]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[32]  V. Marx Biology: The big challenges of big data , 2013, Nature.

[33]  Jiangning Song,et al.  Collaborative data analytics towards prediction on pathogen-host protein-protein interactions , 2017, 2017 IEEE 21st International Conference on Computer Supported Cooperative Work in Design (CSCWD).

[34]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[35]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.