Cracking the black box of deep sequence-based protein–protein interaction prediction

Identifying protein-protein interactions (PPIs) is crucial for deciphering biological pathways. Numerous prediction methods have been developed as cheap alternatives to biological experiments, reporting surprisingly high accuracy estimates. We systematically investigated how much reproducible deep learning models depend on data leakage, sequence similarities, and node degree information, and compared them to basic machine learning models. We found that overlaps between training and test sets resulting from random splitting lead to strongly overestimated performances. In this setting, models learn solely from sequence similarities and node degrees. When data leakage is avoided by minimizing sequence similarities between training and test set, performances become random. Moreover, baseline models directly leveraging sequence similarity and network topology show good performances at a fraction of the computational cost. Thus, we advocate that any improvements should be reported relative to baseline methods in the future. Our findings suggest that predicting protein-protein interactions remains an unsolved task for proteins showing little sequence similarity to previously studied proteins, highlighting that further experimental research into the “dark” protein interactome and better computational methods are needed.

[1]  G. Menichetti,et al.  Improving the generalizability of protein-ligand binding predictions with AI-Bind , 2023, Nature Communications.

[2]  W. Michalowski,et al.  Protein–protein interaction prediction with deep learning: A comprehensive review , 2022, Computational and structural biotechnology journal.

[3]  J. Rasko,et al.  Illuminating the dark protein-protein interactome , 2022, Cell reports methods.

[4]  A. Narayanan,et al.  Leakage and the Reproducibility Crisis in ML-based Science , 2022, ArXiv.

[5]  Hualiang Jiang,et al.  Recent advances in predicting protein-protein interactions with the aid of artificial intelligence algorithms. , 2022, Current opinion in structural biology.

[6]  M. Niranjan,et al.  TransformerGO: predicting protein–protein interactions by modelling the attention between sets of gene ontology terms , 2022, Bioinform..

[7]  William Stafford Noble,et al.  Navigating the pitfalls of applying machine learning in genomics , 2021, Nature Reviews Genetics.

[8]  D. Hassabis,et al.  Protein complex prediction with AlphaFold-Multimer , 2021, bioRxiv.

[9]  L. Cowen,et al.  D-SCRIPT translates genome to phenome with sequence-based, structure-aware, genome-scale predictions of protein-protein interactions. , 2021, Cell systems.

[10]  David B. Blumenthal,et al.  The AIMe registry for artificial intelligence in biomedical research , 2021, Nature Methods.

[11]  Oriol Vinyals,et al.  Highly accurate protein structure prediction with AlphaFold , 2021, Nature.

[12]  Sriparna Saha,et al.  Amalgamation of 3D structure and sequence information for protein–protein interaction prediction , 2020, Scientific Reports.

[13]  Tim Kacprowski,et al.  DIGGER: exploring the functional role of alternative splicing in protein interactions , 2020, Nucleic Acids Res..

[14]  Hiroyuki Kurata,et al.  Evolution of Sequence-based Bioinformatics Tools for Protein-protein Interaction Prediction , 2020, Current genomics.

[15]  Liang Cheng,et al.  Conjoint Feature Representation of GO and Protein Sequence for PPI Prediction Based on an Inception RNN Attention Network , 2020, Molecular therapy. Nucleic acids.

[16]  Ananthan Nambiar,et al.  Transforming the Language of Life: Transformer Neural Networks for Protein Prediction Tasks , 2020, bioRxiv.

[17]  Wei Chen,et al.  Protein-Protein Interactions Prediction Based on Graph Energy and Protein Sequence Information , 2020, Molecules.

[18]  Lingqing Wang,et al.  Predicting Protein-Protein Interactions from Matrix-Based Protein Sequence Using Convolution Neural Network and Feature-Selective Rotation Forest , 2019, Scientific Reports.

[19]  Carlo Zaniolo,et al.  Multifaceted protein–protein interaction prediction based on Siamese residual RCNN , 2019, Bioinform..

[20]  Yu Yao,et al.  An integration of deep learning with feature embedding for protein–protein interaction prediction , 2019, PeerJ.

[21]  Yuh-Jyh Hu,et al.  Protein-protein interaction prediction using a hybrid feature representation and a stacked generalization scheme , 2019, BMC Bioinformatics.

[22]  Bonnie Berger,et al.  Learning protein sequence embeddings using information from structure , 2019, ICLR.

[23]  Florian Richoux,et al.  Comparing two deep learning sequence-based models for protein-protein interaction prediction , 2019, ArXiv.

[24]  Behnam Neyshabur,et al.  Predicting protein‐protein interactions through sequence‐based deep learning , 2018, Bioinform..

[25]  Yong Zhou,et al.  Using Two-dimensional Principal Component Analysis and Rotation Forest for Prediction of Protein-Protein Interactions , 2018, Scientific Reports.

[26]  Long Zhang,et al.  Protein-Protein Interactions Prediction Using a Novel Local Conjoint Triad Descriptor of Amino Acid Sequences , 2017, International journal of molecular sciences.

[27]  Chia-Wei Chen,et al.  OPATs: Omnibus P-value association tests , 2017, Briefings Bioinform..

[28]  Yu Yao,et al.  DeepPPI: Boosting Prediction of Protein-Protein Interactions with Deep Neural Networks , 2017, J. Chem. Inf. Model..

[29]  Luhua Lai,et al.  Sequence-based prediction of protein protein interaction using a deep-learning algorithm , 2017, BMC Bioinformatics.

[30]  Dongmei Li,et al.  Bon-EV: an improved multiple testing procedure for controlling false discovery rates , 2017, BMC Bioinformatics.

[31]  Martin H. Schaefer,et al.  HIPPIE v2.0: enhancing meaningfulness and reliability of protein–protein interaction networks , 2016, Nucleic Acids Res..

[32]  Jijun Tang,et al.  Predicting protein-protein interactions via multivariate mutual information of protein sequences , 2016, BMC Bioinformatics.

[33]  Zhu-Hong You,et al.  Using Weighted Sparse Representation Model Combined with Discrete Cosine Transformation to Predict Protein-Protein Interactions from Protein Sequence , 2015, BioMed research international.

[34]  Alfonso Valencia,et al.  Detection of significant protein coevolution , 2015, Bioinform..

[35]  B. Rost,et al.  Evolutionary profiles improve protein-protein interaction prediction from sequence , 2015, Bioinform..

[36]  Keith C. C. Chan,et al.  Discovering Variable-Length Patterns in Protein Sequences for Protein-Protein Interaction Prediction , 2015, IEEE Transactions on NanoBioscience.

[37]  Burkhard Rost,et al.  More challenges for machine-learning protein interactions , 2015, Bioinform..

[38]  Zhu-Hong You,et al.  Predicting Protein-Protein Interactions from Primary Protein Sequences Using a Novel Multi-Scale Local Feature Representation Scheme and the Random Forest , 2015, PloS one.

[39]  Zhu-Hong You,et al.  Detecting Protein-Protein Interactions with a Novel Matrix-Based Protein Sequence Representation and Support Vector Machines , 2015, BioMed research international.

[40]  John P. Wikander,et al.  Ensemble learning prediction of protein-protein interactions using proteins functional annotations. , 2014, Molecular Biosystems.

[41]  V. S. Rao,et al.  Protein-Protein Interaction Detection: Methods and Analysis , 2014, International journal of proteomics.

[42]  Dmitrij Frishman,et al.  Negatome 2.0: a database of non-interacting proteins derived by literature mining, manual annotation and protein structure analysis , 2013, Nucleic Acids Res..

[43]  Thomas Rattei,et al.  SIMAP—the database of all-against-all protein sequence similarities and annotations with new interfaces and increased coverage , 2013, Nucleic Acids Res..

[44]  P. Sanders,et al.  Think Locally, Act Globally: Highly Balanced Graph Partitioning , 2013, SEA.

[45]  Zhu-Hong You,et al.  Prediction of protein-protein interactions from amino acid sequences with ensemble extreme learning machines and principal component analysis , 2013, BMC Bioinformatics.

[46]  E. Marcotte,et al.  A flaw in the typical evaluation scheme for pair-input computational predictions , 2012, Nature Methods.

[47]  Zhengwei Zhu,et al.  CD-HIT: accelerated for clustering the next-generation sequencing data , 2012, Bioinform..

[48]  Saharon Rosset,et al.  Leakage in data mining: formulation, detection, and avoidance , 2011, TKDD.

[49]  Hongbin Shen,et al.  Large-scale prediction of human protein-protein interactions from amino acid sequence based on latent topic features. , 2010, Journal of proteome research.

[50]  Menglong Li,et al.  PRED_PPI: a server for predicting protein-protein interactions based on sequence data with probability assignment , 2010, BMC Research Notes.

[51]  Yanzhi Guo,et al.  Using support vector machine combined with auto covariance to predict protein–protein interactions from protein sequences , 2008, Nucleic acids research.

[52]  Sara Linse,et al.  Methods for the detection and analysis of protein–protein interactions , 2007, Proteomics.

[53]  Juwen Shen,et al.  Predicting protein–protein interactions based only on sequences information , 2007, Proceedings of the National Academy of Sciences.

[54]  William Stafford Noble,et al.  Choosing negative examples for the prediction of protein-protein interactions , 2006, BMC Bioinformatics.

[55]  Bernhard Schölkopf,et al.  Learning with Local and Global Consistency , 2003, NIPS.

[56]  A. Valencia,et al.  Similarity of phylogenetic trees as indicator of protein-protein interaction. , 2001, Protein engineering.

[57]  F. Luo,et al.  Conflict of Interest the Authors Declare that They Have No Conflict of Interest , 2022, SSRN Electronic Journal.

[58]  OUP accepted manuscript , 2022, Bioinformatics.

[59]  OUP accepted manuscript , 2022, Bioinformatics.

[60]  Anish Kumar,et al.  Effect of Dimensionality Reduction on Classification Accuracy for Protein–Protein Interaction Prediction , 2020 .

[61]  Melissa J. Davis,et al.  Gene Ontology-driven inference of protein-protein interactions using inducers , 2012, Bioinform..