DELPHI: accurate deep ensemble model for protein interaction sites prediction.

MOTIVATION Proteins usually perform their functions by interacting with other proteins, which is why accurately predicting protein-protein interaction (PPI) binding sites is a fundamental problem. Experimental methods are slow and expensive. Therefore, great efforts are being made towards increasing the performance of computational methods. RESULTS We propose DELPHI (DEep Learning Prediction of Highly probable protein Interaction sites), a new sequence-based deep learning suite for PPI binding sites prediction. DELPHI has an ensemble structure which combines a CNN and a RNN component with fine tuning technique. Three novel features, HSP, position information, and ProtVec are used in addition to nine existing ones. We comprehensively compare DELPHI to nine state-of-the-art programs on five datasets, and DELPHI outperforms the competing methods in all metrics even though its training dataset shares the least similarities with the testing datasets. In the most important metrics, AUPRC and MCC, it surpasses the second best programs by as much as 18.5% and 27.7%, resp. We also demonstrated that the improvement is essentially due to using the ensemble model and, especially, the three new features. Using DELPHI it is shown that there is a strong correlation with protein-binding residues (PBRs) and sites with strong evolutionary conservation. In addition DELPHI's predicted PBR sites closely match known data from Pfam. DELPHI is available as open sourced standalone software and web server. AVAILABILITY The DELPHI web server can be found at www.csd.uwo.ca/~yli922/index.php, with all datasets and results in this study. The trained models, the DELPHI standalone source code, and the feature computation pipeline are freely available at github.com/lucian-ilie/DELPHI. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  Zhengwei Zhu,et al.  CD-HIT: accelerated for clustering the next-generation sequencing data , 2012, Bioinform..

[2]  Alan Wee-Chung Liew,et al.  Sequence‐based prediction of protein–peptide binding sites using support vector machine , 2016, J. Comput. Chem..

[3]  Albert Y. Zomaya,et al.  A Survey of Mobile Device Virtualization , 2016, ACM Comput. Surv..

[4]  B. Honig,et al.  A hybrid method for protein–protein interface prediction , 2016, Protein science : a publication of the Protein Society.

[5]  Z. Weng,et al.  Protein–protein docking benchmark version 3.0 , 2008, Proteins.

[6]  Benjamin A. Shoemaker,et al.  Deciphering Protein–Protein Interactions. Part I. Experimental Techniques and Databases , 2007, PLoS Comput. Biol..

[7]  Vladimir Vacic,et al.  Composition Profiler: a tool for discovery and visualization of amino acid composition differences , 2007, BMC Bioinformatics.

[8]  Min Li,et al.  Protein-protein interaction site prediction through combining local and global features with deep neural networks , 2019, Bioinform..

[9]  A. Biegert,et al.  HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment , 2011, Nature Methods.

[10]  L. Bonetta Protein–protein interactions: Interactome under construction , 2010, Nature.

[11]  Kuo-Chen Chou,et al.  iPPBS-Opt: A Sequence-Based Ensemble Classifier for Identifying Protein-Protein Binding Sites by Optimizing Imbalanced Training Datasets , 2016, Molecules.

[12]  Xiuquan Du,et al.  Improved Prediction of Protein Binding Sites from Sequences Using Genetic Algorithm , 2009, The protein journal.

[13]  Jing-Yu Yang,et al.  Protein-protein interaction sites prediction by ensembling SVM and sample-weighted random forests , 2016, Neurocomputing.

[14]  Alice C McHardy,et al.  Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX) , 2018, Scientific Reports.

[15]  Kourosh Vahdati,et al.  Genome-wide patterns of population structure and association mapping of nut-related traits in Persian walnut populations from Iran using the Axiom J. regia 700K SNP array , 2019, Scientific Reports.

[16]  Luís Torgo,et al.  A Survey of Predictive Modeling on Imbalanced Domains , 2016, ACM Comput. Surv..

[17]  Lucian Ilie,et al.  SPRINT: ultrafast protein-protein interaction prediction of the entire human interactome , 2017, BMC Bioinformatics.

[18]  Kai Ming Ting,et al.  An Instance-weighting Method to Induce Cost-sensitive Trees , 2001 .

[19]  P E Bourne,et al.  The Protein Data Bank. , 2002, Nucleic acids research.

[20]  Burkhard Rost,et al.  ISIS: interaction sites identified from sequence , 2007, Bioinform..

[21]  Kenji Mizuguchi,et al.  Applying the Naïve Bayes classifier with kernel density estimation to the prediction of protein-protein interaction sites , 2010, Bioinform..

[22]  Burkhard Rost,et al.  Modeling aspects of the language of life through transfer-learning protein sequences , 2019, BMC Bioinformatics.

[23]  Parviz Abdolmaleki,et al.  Predictions of Protein-Protein Interfaces within Membrane Protein Complexes , 2013, Avicenna journal of medical biotechnology.

[24]  Jing-Yu Yang,et al.  A Cascade Random Forests Algorithm for Predicting Protein-Protein Interaction Sites , 2015, IEEE Transactions on NanoBioscience.

[25]  Jinyan Li,et al.  Sequence-based identification of interface residues by an integrative profile combining hydrophobic and evolutionary information , 2010, BMC Bioinformatics.

[26]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[27]  Jinyan Li,et al.  Sequence-based prediction of protein-protein interaction sites by simplified long short-term memory network , 2019, Neurocomputing.

[28]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[29]  Xue-wen Chen,et al.  Sequence-based prediction of protein interaction sites with an integrative method , 2009, Bioinform..

[30]  Zengyan Xie,et al.  Prediction of Protein–Protein Interaction Sites Using Convolutional Neural Network and Improved Data Sets , 2020, International journal of molecular sciences.

[31]  Michal Brylinski,et al.  Prediction of protein–protein interaction sites from weakly homologous template structures using meta‐threading and machine learning , 2015, Journal of molecular recognition : JMR.

[32]  Hong Yan,et al.  Fast prediction of protein-protein interaction sites based on Extreme Learning Machines , 2014, Neurocomputing.

[33]  Vasant Honavar,et al.  HomPPI: a class of sequence homology based protein-protein interface prediction methods , 2011, BMC Bioinformatics.

[34]  Nir London,et al.  The structural basis of peptide-protein binding strategies. , 2010, Structure.

[35]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[36]  Xiaoying Wang,et al.  Protein-protein interaction sites prediction by ensemble random forests with synthetic minority oversampling technique , 2018, Bioinform..

[37]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[38]  Alessandra Carbone,et al.  Local Geometry and Evolutionary Conservation of Protein Surfaces Reveal the Multiple Recognition Patches in Protein-Protein Interactions , 2015, PLoS Comput. Biol..

[39]  Lukasz Kurgan,et al.  Comprehensive review and empirical analysis of hallmarks of DNA-, RNA- and protein-binding residues in protein chains , 2019, Briefings Bioinform..

[40]  Guoli Wang,et al.  PISCES: a protein sequence culling server , 2003, Bioinform..

[41]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[42]  Stephen H. White,et al.  Experimentally determined hydrophobicity scale for proteins at membrane interfaces , 1996, Nature Structural Biology.

[43]  Kaustubh D. Dhole,et al.  Sequence-based prediction of protein-protein interaction sites with L1-logreg classifier. , 2014, Journal of theoretical biology.

[44]  Michal Brylinski,et al.  Template-based identification of protein-protein interfaces using eFindSitePPI. , 2016, Methods.

[45]  Yaoqi Zhou,et al.  Accurate single‐sequence prediction of solvent accessible surface area using local and global features , 2014, Proteins.

[46]  Massimiliano Pontil,et al.  PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments , 2012, Bioinform..

[47]  Yu Liu,et al.  Prediction of Protein-Protein Interaction Sites Based on Naive Bayes Classifier , 2015, Biochemistry research international.

[48]  Zsuzsanna Dosztányi,et al.  ANCHOR: web server for predicting protein binding regions in disordered proteins , 2009, Bioinform..

[49]  Ehsaneddin Asgari,et al.  Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics , 2015, PloS one.

[50]  Lukasz Kurgan,et al.  SCRIBER: accurate and partner type-specific prediction of protein-binding residues from proteins sequences , 2019, Bioinform..

[51]  R. Doolittle,et al.  A simple method for displaying the hydropathic character of a protein. , 1982, Journal of molecular biology.

[52]  Lukasz A. Kurgan,et al.  Review and comparative assessment of sequence‐based predictors of protein‐binding residues , 2018, Briefings Bioinform..

[53]  Takaya Saito,et al.  The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets , 2015, PloS one.

[54]  Hong-Bin Shen,et al.  Prediction of Protein–Protein Interaction Sites with Machine-Learning-Based Data-Cleaning and Post-Filtering Procedures , 2015, The Journal of Membrane Biology.

[55]  Aleksey A. Porollo,et al.  Prediction‐based fingerprints of protein–protein interactions , 2006, Proteins.

[56]  Yang Zhang,et al.  BioLiP: a semi-manually curated database for biologically relevant ligand–protein interactions , 2012, Nucleic Acids Res..

[57]  Aleksey A. Porollo,et al.  Enhanced recognition of protein transmembrane domains with prediction-based structural profiles , 2006, Bioinform..

[58]  Ashkan Golshani,et al.  Binding Site Prediction for Protein-Protein Interactions and Novel Motif Discovery using Re-occurring Polypeptide Sequences , 2011, BMC Bioinformatics.