DIPS-Plus: The enhanced database of interacting protein structures for interface prediction

How and where proteins interface with one another can ultimately impact the proteins’ functions along with a range of other biological processes. As such, precise computational methods for protein interface prediction (PIP) come highly sought after as they could yield significant advances in drug discovery and design as well as protein function analysis. However, the traditional benchmark dataset for this task, Docking Benchmark 5 (DB5) [1], contains only a modest 230 complexes for training, validating, and testing different machine learning algorithms. In this work, we expand on a dataset recently introduced for this task, the Database of Interacting Protein Structures (DIPS) [2, 3], to present DIPS-Plus, an enhanced, feature-rich dataset of 42,112 complexes for geometric deep learning of protein interfaces. The previous version of DIPS contains only the Cartesian coordinates and types of the atoms comprising a given protein complex, whereas DIPS-Plus now includes a plethora of new residue-level features including protrusion indices, half-sphere amino acid compositions, and new profile hidden Markov model (HMM)-based sequence features for each amino acid, giving researchers a large, well-curated feature bank for training protein interface prediction methods. We demonstrate through rigorous benchmarks that training an existing state-of-the-art (SOTA) model for PIP on DIPS-Plus yields SOTA results, surpassing the performance of all other models trained on residue-level and atom-level encodings of protein complexes to date.

[1]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[2]  Alexandre Tkatchenko,et al.  Quantum-chemical insights from deep tensor neural networks , 2016, Nature Communications.

[3]  K. Mizuguchi,et al.  Partner-Aware Prediction of Interacting Residues in Protein-Protein Complexes from Sequence Data , 2011, PloS one.

[4]  A. Ben-Hur,et al.  PAIRpred: Partner‐specific prediction of interacting residues from sequence and structure , 2014, Proteins.

[5]  M. Šikić,et al.  PSAIA – Protein Structure and Interaction Analyzer , 2008, BMC Structural Biology.

[6]  E Siva Sankari,et al.  Predicting membrane protein types using various decision tree classifiers based on various modes of general PseAAC for imbalanced datasets. , 2017, Journal of theoretical biology.

[7]  Rishi Bedi,et al.  End-to-End Learning on 3D Protein Structure for Interface Prediction , 2019, NeurIPS.

[8]  Jie Li,et al.  PDB-wide collection of binding data: current status of the PDBbind database , 2015, Bioinform..

[9]  Bartek Wilczynski,et al.  Biopython: freely available Python tools for computational molecular biology and bioinformatics , 2009, Bioinform..

[10]  Kenji Mizuguchi,et al.  Network analysis and in silico prediction of protein-protein interactions with applications in drug discovery. , 2017, Current opinion in structural biology.

[11]  T. Hamelryck An amino acid has two sides: A new 2D measure provides a different view of solvent exposure , 2005, Proteins.

[12]  Alex Smola,et al.  Deep Graph Library: Towards Efficient and Scalable Deep Learning on Graphs , 2019, ArXiv.

[13]  Bin Liu,et al.  DeepSVM-fold: protein fold recognition by combining support vector machines and pairwise sequence similarity scores generated by deep learning networks , 2019, Briefings Bioinform..

[14]  B. Rost,et al.  Conservation and prediction of solvent accessibility in protein families , 1994, Proteins.

[15]  Shuiwang Ji,et al.  Deep Learning of High-Order Interactions for Protein Interface Prediction , 2020, KDD.

[16]  Joan Bruna,et al.  Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges , 2021, ArXiv.

[17]  K. Jarrod Millman,et al.  Array programming with NumPy , 2020, Nat..

[18]  D. Pal,et al.  Main-chain conformational features at different conformations of the side-chains in proteins. , 1998, Protein engineering.

[19]  Myle Ott,et al.  Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences , 2019, Proceedings of the National Academy of Sciences.

[20]  Alán Aspuru-Guzik,et al.  Convolutional Networks on Graphs for Learning Molecular Fingerprints , 2015, NIPS.

[21]  M. Bronstein,et al.  Deciphering interaction fingerprints from protein molecular surfaces using geometric deep learning , 2019, Nature Methods.

[22]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[23]  Sameer Velankar,et al.  Worldwide Protein Data Bank validation information: usage and trends , 2018, Acta crystallographica. Section D, Structural biology.

[24]  Vasant Honavar,et al.  Predicting protein-protein interface residues using local surface structural similarity , 2012, BMC Bioinformatics.

[25]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[26]  Alex Fout,et al.  Protein Interface Prediction using Graph Convolutional Networks , 2017, NIPS.

[27]  José María Carazo,et al.  BIPSPI: a method for the prediction of partner-specific protein–protein interfaces , 2018, Bioinform..

[28]  Yang Zhang,et al.  Protein-ligand binding site recognition using complementary binding-specific substructure comparison and sequence profile alignment , 2013, Bioinform..

[29]  Tom Sercu,et al.  Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences , 2021, Proceedings of the National Academy of Sciences.

[30]  Michal Linial,et al.  Using Bayesian Networks to Analyze Expression Data , 2000, J. Comput. Biol..

[31]  Chris Bailey-Kellogg,et al.  Protein interaction interface region prediction by geometric deep learning , 2021, Bioinform..

[32]  Alexandre M J J Bonvin,et al.  Flexible protein-protein docking. , 2006, Current opinion in structural biology.

[33]  Milot Mirdita,et al.  HH-suite3 for fast remote homology detection and deep protein annotation , 2019, BMC Bioinformatics.

[34]  Kristian Vlahovicek,et al.  Prediction of Protein–Protein Interaction Sites in Sequences and 3D Structures by Random Forests , 2009, PLoS Comput. Biol..

[35]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[36]  Christopher L. McClendon,et al.  Reaching for high-hanging fruit in drug discovery at protein–protein interfaces , 2007, Nature.

[37]  Raphael A. G. Chaleil,et al.  Updates to the Integrated Protein-Protein Interaction Benchmarks: Docking Benchmark Version 5 and Affinity Benchmark Version 2. , 2015, Journal of molecular biology.

[38]  Michael M. McKerns,et al.  Building a Framework for Predictive Science , 2012, SciPy.

[39]  Jinyan Li,et al.  Sequence-based identification of interface residues by an integrative profile combining hydrophobic and evolutionary information , 2010, BMC Bioinformatics.

[40]  John F. Canny,et al.  MSA Transformer , 2021, bioRxiv.

[41]  Zbigniew Dauter,et al.  The quality and validation of structures from structural genomics. , 2014, Methods in molecular biology.

[42]  M. Sanner,et al.  Reduced surface: an efficient way to compute molecular surfaces. , 1996, Biopolymers.

[44]  Achim Tresch,et al.  Modeling the temporal interplay of molecular signaling and gene expression by using dynamic nested effects models , 2009, Proceedings of the National Academy of Sciences.

[45]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[46]  T. Ioerger,et al.  Correlations between secondary structure- and protein-protein interface-mimicry: the interface mimicry hypothesis. , 2019, Organic and biomolecular chemistry.

[47]  J. Söding,et al.  Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold , 2018, bioRxiv.

[48]  David S. Goodsell,et al.  The RCSB Protein Data Bank: redesigned web site and web services , 2010, Nucleic Acids Res..

[49]  Raphael J. L. Townshend,et al.  ATOM3D: Tasks On Molecules in Three Dimensions , 2020, NeurIPS Datasets and Benchmarks.

[50]  Jianlin Cheng,et al.  DNSS2: improved ab initio protein secondary structure prediction using advanced deep learning architectures , 2019, bioRxiv.

[51]  Vasant Honavar,et al.  Characterization of Protein–Protein Interfaces , 2008, The protein journal.

[52]  Gert Vriend,et al.  A series of PDB related databases for everyday needs , 2010, Nucleic Acids Res..