Neural Embeddings for Protein Graphs

Proteins perform much of the work in living organisms, and consequently the development of efficient computational methods for protein representation is essential for advancing large-scale biological research. Most current approaches struggle to efficiently integrate the wealth of information contained in the protein sequence and structure. In this paper, we propose a novel framework for embedding protein graphs in geometric vector spaces, by learning an encoder function that preserves the structural distance between protein graphs. Utilizing Graph Neural Networks (GNNs) and Large Language Models (LLMs), the proposed framework generates structure- and sequence-aware protein representations. We demonstrate that our embeddings are successful in the task of comparing protein structures, while providing a significant speed-up compared to traditional approaches based on structural alignment. Our framework achieves remarkable results in the task of protein structure classification; in particular, when compared to other work, the proposed method shows an average F1-Score improvement of 26% on out-of-distribution (OOD) samples and of 32% when tested on samples coming from the same distribution as the training data. Our approach finds applications in areas such as drug prioritization, drug re-purposing, disease sub-type analysis and elsewhere.

[1]  Cathy H. Wu,et al.  UniProt: the Universal Protein Knowledgebase in 2023 , 2022, Nucleic acids research.

[2]  E. Stollar,et al.  Uncovering protein function: from classification to complexes , 2022, Essays in biochemistry.

[3]  Vijay Prakash Dwivedi,et al.  Long Range Graph Benchmark , 2022, NeurIPS.

[4]  Hongbin Shen,et al.  Fast protein structure comparison through effective representation learning with contrastive graph neural networks , 2022, PLoS Comput. Biol..

[5]  Francesco Di Giovanni,et al.  Neural Sheaf Diffusion: A Topological Perspective on Heterophily and Oversmoothing in GNNs , 2022, NeurIPS.

[6]  Zheng Wang,et al.  PANDA2: protein function prediction using graph neural networks , 2022, NAR genomics and bioinformatics.

[7]  J. F. Beltrán,et al.  Functions predict horizontal gene transfer and the emergence of antibiotic resistance , 2021, Science advances.

[8]  Jure Leskovec,et al.  Neural Distance Embeddings for Biological Sequences , 2021, NeurIPS.

[9]  Oriol Vinyals,et al.  Highly accurate protein structure prediction with AlphaFold , 2021, Nature.

[10]  Yu Guang Wang,et al.  Weisfeiler and Lehman Go Cellular: CW Networks , 2021, NeurIPS.

[11]  Bryn C. Taylor,et al.  Structure-based protein function prediction using graph convolutional networks , 2021, Nature Communications.

[12]  M. Linial,et al.  ProteinBERT: a universal deep-learning model of protein sequence and function , 2021, bioRxiv.

[13]  Thierry Langer,et al.  A compact review of molecular property prediction with graph neural networks. , 2020, Drug discovery today. Technologies.

[14]  Gard Spreemann,et al.  Simplicial Neural Networks , 2020, ArXiv.

[15]  M. Hajij,et al.  Cell Complex Neural Networks , 2020, ArXiv.

[16]  Dandan Song,et al.  Graph-based prediction of Protein-protein interactions with attributed signed graph embedding , 2020, BMC Bioinformatics.

[17]  Yuedong Yang,et al.  Structure-aware protein solubility prediction from sequence through graph convolutional network and predicted contact map , 2020, bioRxiv.

[18]  Xiaofeng Wang,et al.  Drug–target affinity prediction using graph neural network and contact maps , 2020, RSC advances.

[19]  Dick de Ridder,et al.  Caretta – A multiple protein structure alignment and feature extraction suite , 2020, Computational and structural biotechnology journal.

[20]  David T. Jones,et al.  Improved protein structure prediction using potentials from deep learning , 2020, Nature.

[21]  M. Bronstein,et al.  Deciphering interaction fingerprints from protein molecular surfaces using geometric deep learning , 2019, Nature Methods.

[22]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[23]  B. Rost,et al.  Modeling aspects of the language of life through transfer-learning protein sequences , 2019, BMC Bioinformatics.

[24]  Torsten Schwede,et al.  Critical assessment of methods of protein structure prediction (CASP)—Round XIII , 2019, Proteins.

[25]  Jan Eric Lenssen,et al.  Fast Graph Representation Learning with PyTorch Geometric , 2019, ArXiv.

[26]  S. Jalan,et al.  Network spectra for drug-target identification in complex diseases: new guns against old foes , 2018, Appl. Netw. Sci..

[27]  Jure Leskovec,et al.  How Powerful are Graph Neural Networks? , 2018, ICLR.

[28]  Frederic Sala,et al.  Learning Mixed-Curvature Representations in Product Spaces , 2018, ICLR.

[29]  Fei Wang,et al.  Drug Similarity Integration Through Attentive Multi-view Graph Auto-Encoders , 2018, IJCAI.

[30]  A. Tramontano,et al.  Critical assessment of methods of protein structure prediction (CASP)—Round XII , 2018, Proteins.

[31]  Sameer Velankar,et al.  The challenge of modeling protein assemblies: the CASP12‐CAPRI experiment , 2018, Proteins.

[32]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[33]  Jure Leskovec,et al.  Inductive Representation Learning on Large Graphs , 2017, NIPS.

[34]  Douwe Kiela,et al.  Poincaré Embeddings for Learning Hierarchical Representations , 2017, NIPS.

[35]  Yang Liu,et al.  Learning structural motif representations for efficient protein structure search , 2017, bioRxiv.

[36]  Samuel S. Schoenholz,et al.  Neural Message Passing for Quantum Chemistry , 2017, ICML.

[37]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[38]  S. Brunak,et al.  Network biology concepts in complex disease comorbidities , 2016, Nature Reviews Genetics.

[39]  Yin Li,et al.  Learning Deep Structure-Preserving Image-Text Embeddings , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  John J. Irwin,et al.  ZINC 15 – Ligand Discovery for Everyone , 2015, J. Chem. Inf. Model..

[41]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[42]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[43]  Steven E. Brenner,et al.  SCOPe: Structural Classification of Proteins—extended, integrating SCOP and ASTRAL data and classification of new structures , 2013, Nucleic Acids Res..

[44]  Y. Moreau,et al.  Computational tools for prioritizing candidate genes: boosting disease gene discovery , 2012, Nature Reviews Genetics.

[45]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[46]  Inbal Budowski-Tal,et al.  FragBag, an accurate representation of protein structure, retrieves structural neighbors from the entire PDB quickly and accurately , 2010, Proceedings of the National Academy of Sciences.

[47]  M. Helmer-Citterich,et al.  Structure-based function prediction: approaches and applications. , 2008, Briefings in functional genomics & proteomics.

[48]  Liisa Holm,et al.  Using Dali for structural comparison of proteins. , 2006, Current protocols in bioinformatics.

[49]  D. O’Leary,et al.  Secondary structure spatial conformation footprint: a novel method for fast protein structure comparison and classification , 2006, BMC Structural Biology.

[50]  J. Skolnick,et al.  TM-align: a protein structure alignment algorithm based on the TM-score , 2005, Nucleic acids research.

[51]  Yang Zhang,et al.  Scoring function for automated assessment of protein structure template quality , 2004, Proteins.

[52]  J. Skolnick,et al.  The PDB is a covering set of small protein structures. , 2003, Journal of molecular biology.

[53]  Haruki Nakamura,et al.  Announcing the worldwide Protein Data Bank , 2003, Nature Structural Biology.

[54]  P. Røgen,et al.  Automatic classification of protein structure by using Gauss integrals , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[55]  Jens Meiler,et al.  Generation and evaluation of dimension-reduced amino acid parameter representations by artificial neural networks , 2001 .

[56]  P E Bourne,et al.  Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. , 1998, Protein engineering.

[57]  M. Levitt,et al.  A unified statistical framework for sequence comparison and structure comparison. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[58]  R. Lathrop The protein threading problem with sequence amino acid interaction preferences is NP-complete. , 1994, Protein engineering.

[59]  C. Sander,et al.  Protein structure comparison by alignment of distance matrices. , 1993, Journal of molecular biology.

[60]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[61]  Kunihiko Fukushima,et al.  Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position , 1980, Biological Cybernetics.

[62]  Daniel C. Elton,et al.  Deep learning for molecular generation and optimization - a review of the state of the art , 2019, Molecular Systems Design & Engineering.

[63]  Darren P Martin,et al.  Phylogenetic reconstruction methods: an overview. , 2014, Methods in molecular biology.

[64]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[65]  Ah Chung Tsoi,et al.  The Graph Neural Network Model , 2009, IEEE Transactions on Neural Networks.

[66]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[67]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .