APEX2S: A two‐layer machine learning model for discovery of host‐pathogen protein‐protein interactions on cloud‐based multiomics data

Presented by the avalanche of biological interactions data, computational biology is now facing greater challenges on big data analysis and solicits more studies to mine and integrate cloud‐based multiomics data, especially when the data are related to infectious diseases. Meanwhile, machine learning techniques have recently succeeded in different computational biology tasks. In this article, we have calibrated the focus for host‐pathogen protein‐protein interactions study, aiming to apply the machine learning techniques for learning the interactions data and making predictions. A comprehensive and practical workflow to harness different cloud‐based multiomics data is discussed. In particular, a novel two‐layer machine learning model, namely APEX2S, is proposed for discovery of the protein‐protein interactions data. The results show that our model can better learn and predict from the accumulated host‐pathogen protein‐protein interactions.

[1]  Jaime G. Carbonell,et al.  Multitask learning for host–pathogen protein interactions , 2013, Bioinform..

[2]  Alex Alves Freitas,et al.  Optimizing amino acid groupings for GPCR classification , 2008, Bioinform..

[3]  Lennart Martens,et al.  Human Proteome Organization Proteomics Standards Initiative: data standardization, a view on developments and policy. , 2007, Molecular & cellular proteomics : MCP.

[4]  Lincoln Stein,et al.  Reactome: a knowledgebase of biological pathways , 2004, Nucleic Acids Res..

[5]  Christopher N. Larsen,et al.  BioHealthBase: informatics support in the elucidation of influenza virus host–pathogen interactions and virulence , 2007, Nucleic Acids Res..

[6]  Daisuke Kihara,et al.  Computational identification of protein-protein interactions in model plant proteomes , 2019, Scientific Reports.

[7]  Tatsuya Akutsu,et al.  Determining the minimum number of protein-protein interactions required to support known protein complexes , 2018, PloS one.

[8]  Eileen Kraemer,et al.  EuPathDB: a portal to eukaryotic pathogen databases , 2009, Nucleic Acids Res..

[9]  Yu Guo,et al.  Prediction of host - pathogen protein interactions between Mycobacterium tuberculosis and Homo sapiens using sequence motifs , 2015, BMC Bioinformatics.

[10]  Fatih Erdogan Sevilgen,et al.  PHISTO: pathogen-host interaction search tool , 2013, Bioinform..

[11]  Yaohang Li,et al.  A novel method of gene regulatory network structure inference from gene knock-out expression data , 2019, Tsinghua Science and Technology.

[12]  Hong-Bin Shen,et al.  Predicting pupylation sites in prokaryotic proteins using pseudo-amino acid composition and extreme learning machine , 2014, Neurocomputing.

[13]  Jorma Laurikkala,et al.  Improving Identification of Difficult Small Classes by Balancing Class Distribution , 2001, AIME.

[14]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[15]  Yan Zhang,et al.  PATRIC, the bacterial bioinformatics database and analysis resource , 2013, Nucleic Acids Res..

[16]  Thomas J. Watson,et al.  An empirical study of the naive Bayes classifier , 2001 .

[17]  L. Castagnoli,et al.  mentha: a resource for browsing integrated protein-interaction networks , 2013, Nature Methods.

[18]  Yun Zhang,et al.  ViPR: an open bioinformatics database and analysis resource for virology research , 2011, Nucleic Acids Res..

[19]  Patricia Muñoz,et al.  Revised definitions of invasive fungal disease from the European Organization for Research and Treatment of Cancer/Invasive Fungal Infections Cooperative Group and the National Institute of Allergy and Infectious Diseases Mycoses Study Group (EORTC/MSG) Consensus Group. , 2008, Clinical infectious diseases : an official publication of the Infectious Diseases Society of America.

[20]  Sandhya Rani,et al.  Human Protein Reference Database—2009 update , 2008, Nucleic Acids Res..

[21]  The Gene Ontology Consortium,et al.  The Gene Ontology Resource: 20 years and still GOing strong , 2018, Nucleic Acids Res..

[22]  David A. Landgrebe,et al.  A survey of decision tree classifier methodology , 1991, IEEE Trans. Syst. Man Cybern..

[23]  Bindu Nanduri,et al.  HPIDB - a unified resource for host-pathogen interactions , 2010, BMC Bioinformatics.

[24]  Chao Wu,et al.  Integrating gene expression and protein-protein interaction network to prioritize cancer-associated genes , 2012, BMC Bioinformatics.

[25]  Fabio Tordini,et al.  A Cloud Solution for Multi-omics Data Integration , 2016, 2016 Intl IEEE Conferences on Ubiquitous Intelligence & Computing, Advanced and Trusted Computing, Scalable Computing and Communications, Cloud and Big Data Computing, Internet of People, and Smart World Congress (UIC/ATC/ScalCom/CBDCom/IoP/SmartWorld).

[26]  D. Onstad,et al.  Description and analysis of two internet-based databases of insect pathogens: EDWIP and VIDIL. , 2003, Journal of invertebrate pathology.

[27]  Bin Fang,et al.  Machine learning-based multi-modal information perception for soft robotic hands , 2020 .

[28]  Mona Singh,et al.  Computational solutions for omics data , 2013, Nature Reviews Genetics.

[29]  The UniProt Consortium,et al.  UniProt: a worldwide hub of protein knowledge , 2018, Nucleic Acids Res..

[30]  Ruth Nussinov,et al.  Prediction of Host-Pathogen Interactions for Helicobacter pylori by Interface Mimicry and Implications to Gastric Cancer. , 2017, Journal of molecular biology.

[31]  Paul T. Groth,et al.  The ENCODE (ENCyclopedia Of DNA Elements) Project , 2004, Science.

[32]  Carlos Prieto,et al.  APID: Agile Protein Interaction DataAnalyzer , 2006, Nucleic Acids Res..

[33]  Kristin L. Sainani,et al.  Logistic Regression , 2014, PM & R : the journal of injury, function, and rehabilitation.

[34]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[35]  Salvatore Cuomo,et al.  A predictive Decision Support System (DSS) for a microalgae production plant based on Internet of Things paradigm , 2018, Concurr. Comput. Pract. Exp..

[36]  Johannes Goll,et al.  Protein interaction data curation: the International Molecular Exchange (IMEx) consortium , 2012, Nature Methods.

[37]  Jiangning Song,et al.  Structural Principles Analysis of Host-Pathogen Protein-Protein Interactions: A Structural Bioinformatics Survey , 2018, IEEE Access.

[38]  Jaime G. Carbonell,et al.  Multitask Matrix Completion for Learning Protein Interactions Across Diseases , 2016, RECOMB.

[39]  Salvatore Cuomo,et al.  Social network data analysis and mining applications for the Internet of Data , 2018, Concurr. Comput. Pract. Exp..

[40]  M. Michael Gromiha,et al.  Protein-protein binding affinity prediction from amino acid sequence , 2014, Bioinform..

[41]  Rafael C. Jimenez,et al.  The IntAct molecular interaction database in 2012 , 2011, Nucleic Acids Res..

[42]  Bin Liu,et al.  BioSeq-Analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches , 2019, Briefings Bioinform..

[43]  Jake Yue Chen,et al.  Reordering based integrative expression profiling for microarray classification , 2012, BMC Bioinformatics.

[44]  E. Adebiyi,et al.  Inter-Species/Host-Parasite Protein Interaction Predictions Reviewed , 2018, Current bioinformatics.

[45]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[46]  Lei Wang,et al.  Leveraging SMOTE in a Two-Layer Model for Prediction of Protein-Protein Interactions , 2019, 2019 Seventh International Conference on Advanced Cloud and Big Data (CBD).

[47]  Peter Uetz,et al.  The EHEC-host interactome reveals novel targets for the translocated intimin receptor , 2014, Scientific Reports.

[48]  Livia Perfetto,et al.  MINT, the molecular interaction database: 2012 update , 2011, Nucleic Acids Res..

[49]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[50]  Karin Breuer,et al.  InnateDB: systems biology of innate immunity and beyond—recent updates and continuing curation , 2012, Nucleic Acids Res..

[51]  Harry Zhang,et al.  The Optimality of Naive Bayes , 2004, FLAIRS.

[52]  Xue-wen Chen,et al.  On Position-Specific Scoring Matrix for Protein Function Prediction , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[53]  S. Wuchty Computational Prediction of Host-Parasite Protein Interactions between P. falciparum and H. sapiens , 2011, PloS one.

[54]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[55]  Kyungsook Han,et al.  Prediction of protein-protein interactions between viruses and human by an SVM model , 2012, BMC Bioinformatics.

[56]  Matthew D. Dyer,et al.  Supervised learning and prediction of physical interactions between human and HIV proteins. , 2011, Infection, genetics and evolution : journal of molecular epidemiology and evolutionary genetics in infectious diseases.

[57]  Ioannis Xenarios,et al.  DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions , 2002, Nucleic Acids Res..

[58]  K. Chou,et al.  PseAAC: a flexible web server for generating various kinds of protein pseudo amino acid composition. , 2008, Analytical biochemistry.

[59]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[60]  Kara Dolinski,et al.  The BioGRID interaction database: 2017 update , 2016, Nucleic Acids Res..

[61]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[62]  T. M. Murali,et al.  PIG—the pathogen interaction gateway , 2008, Nucleic Acids Res..

[63]  Ye Wang,et al.  A novel deep learning method for extracting unspecific biomedical relation , 2020, Concurr. Comput. Pract. Exp..

[64]  Bindu Nanduri,et al.  HPIDB 2.0: a curated database for host–pathogen interactions , 2016, Database J. Biol. Databases Curation.

[65]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[66]  Jiangning Song,et al.  Towards Data Analytics of Pathogen-Host Protein-Protein Interaction: A Survey , 2016, 2016 IEEE International Congress on Big Data (BigData Congress).

[67]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[68]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001, Proteins.

[69]  Sean R. Davis,et al.  NCBI GEO: archive for functional genomics data sets—update , 2012, Nucleic Acids Res..

[70]  Geoffrey Fox,et al.  Special Issue: Workflow in Grid Systems , 2006, Concurr. Comput. Pract. Exp..