MSA-Regularized Protein Sequence Transformer toward Predicting Genome-Wide Chemical-Protein Interactions: Application to GPCRome Deorphanization

Small molecules play a critical role in modulating biological systems. Knowledge of chemical–protein interactions helps address fundamental and practical questions in biology and medicine. However, with the rapid emergence of newly sequenced genes, the endogenous or surrogate ligands of a vast number of proteins remain unknown. Homology modeling and machine learning are two major methods for assigning new ligands to a protein but mostly fail when sequence homology between an unannotated protein and those with known functions or structures is low. In this study, we develop a new deep learning framework to predict chemical binding to evolutionary divergent unannotated proteins, whose ligand cannot be reliably predicted by existing methods. By incorporating evolutionary information into self-supervised learning of unlabeled protein sequences, we develop a novel method, distilled sequence alignment embedding (DISAE), for the protein sequence representation. DISAE can utilize all protein sequences and their multiple sequence alignment (MSA) to capture functional relationships between proteins without the knowledge of their structure and function. Followed by the DISAE pretraining, we devise a module-based fine-tuning strategy for the supervised learning of chemical–protein interactions. In the benchmark studies, DISAE significantly improves the generalizability of machine learning models and outperforms the state-of-the-art methods by a large margin. Comprehensive ablation studies suggest that the use of MSA, sequence distillation, and triplet pretraining critically contributes to the success of DISAE. The interpretability analysis of DISAE suggests that it learns biologically meaningful information. We further use DISAE to assign ligands to human orphan G-protein coupled receptors (GPCRs) and to cluster the human GPCRome by integrating their phylogenetic and ligand relationships. The promising results of DISAE open an avenue for exploring the chemical landscape of entire sequenced genomes.

[1]  J. Hanson,et al.  The G protein‐coupled receptors deorphanization landscape , 2018, Biochemical pharmacology.

[2]  György M. Keserü,et al.  GPCRdb in 2021: integrating GPCR sequence, structure and function , 2020, Nucleic Acids Res..

[3]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[4]  Yi Guan,et al.  treeio: an R package for phylogenetic tree input and output with richly annotated and associated data. , 2019, Molecular biology and evolution.

[5]  S. Venkatesh,et al.  Predicting drug–target binding affinity with graph neural networks , 2020 .

[6]  Emmanuel Paradis,et al.  ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R , 2018, Bioinform..

[7]  Di He,et al.  Large-Scale Off-Target Identification Using Fast and Accurate Dual Regularized One-Class Collaborative Filtering and Its Application to Drug Repurposing , 2016, PLoS Comput. Biol..

[8]  George Khelashvili,et al.  A Functional Selectivity Mechanism at the Serotonin-2A GPCR Involves Ligand-Dependent Conformations of Intracellular Loop 2 , 2014, Journal of the American Chemical Society.

[9]  Alex C. Conner,et al.  Understanding the common themes and diverse roles of the second extracellular loop (ECL2) of the GPCR super-family , 2017, Molecular and Cellular Endocrinology.

[10]  Di Wu,et al.  DeepAffinity: Interpretable Deep Learning of Compound-Protein Affinity through Unified Recurrent and Convolutional Neural Networks , 2018, bioRxiv.

[11]  Tudor I. Oprea,et al.  Exploring the dark genome: implications for precision medicine , 2019, Mammalian Genome.

[12]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[13]  Christine Colvis,et al.  Glimmers in illuminating the druggable genome , 2018, Nature Reviews Drug Discovery.

[14]  Seunghyun Park,et al.  Pre-Training of Deep Bidirectional Protein Sequence Representations With Structural Information , 2019, IEEE Access.

[15]  Krzysztof Palczewski,et al.  Role of the conserved NPxxY(x)5,6F motif in the rhodopsin ground state and during activation , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Yang Zhang,et al.  GLASS: a comprehensive database for experimentally validated GPCR-ligand associations , 2015, Bioinform..

[18]  Silvio C. E. Tosatto,et al.  The Pfam protein families database in 2019 , 2018, Nucleic Acids Res..

[19]  B. Rost,et al.  ProtTrans: Towards Cracking the Language of Lifes Code Through Self-Supervised Deep Learning and High Performance Computing. , 2021, IEEE transactions on pattern analysis and machine intelligence.

[20]  Hojung Nam,et al.  DeepConv-DTI: Prediction of drug-target interactions via deep learning with convolution on protein sequences , 2018, PLoS Comput. Biol..

[21]  John P. Overington,et al.  ChEMBL: a large-scale bioactivity database for drug discovery , 2011, Nucleic Acids Res..

[22]  Arne Elofsson,et al.  TransformerCPI: improving compound-protein interaction prediction by sequence-based deep learning with self-attention mechanism and label reversal experiments , 2020, Bioinform..

[23]  David E. Gloriam,et al.  Trends in GPCR drug discovery: new agents, targets and indications , 2017, Nature Reviews Drug Discovery.

[24]  Ju Wang,et al.  Identification of functional divergence sites in dopamine receptors of vertebrates , 2019, Comput. Biol. Chem..

[25]  David K. Smith,et al.  ggtree: an r package for visualization and annotation of phylogenetic trees with their covariates and other associated data , 2017 .

[26]  G. V. van Westen,et al.  Importance of the extracellular loops in G protein-coupled receptors for ligand recognition and receptor activation. , 2011, Trends in pharmacological sciences.

[27]  Marco Punta,et al.  Genome3D: exploiting structure to help users understand their sequences , 2015, Nucleic Acids Res..

[28]  Yongjian Li,et al.  Predicting drug–protein interaction using quasi-visual question answering system , 2019, Nature Machine Intelligence.

[29]  B. Kobilka,et al.  Structure and dynamics of GPCR signaling complexes , 2018, Nature Structural & Molecular Biology.

[30]  Farag F. Sherbiny,et al.  The second extracellular loop of GPCRs determines subtype-selectivity and controls efficacy as evidenced by loop exchange study at A2 adenosine receptors. , 2013, Biochemical pharmacology.

[31]  B. Trzaskowski,et al.  Action of Molecular Switches in GPCRs - Theoretical and Experimental Studies , 2012, Current medicinal chemistry.

[32]  Hui Liu,et al.  Effectively Identifying Compound-Protein Interactions by Learning from Positive and Unlabeled Examples , 2018, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[33]  Lei Xie,et al.  Improved genome-scale multi-target virtual screening via a novel collaborative filtering approach to cold-start problem , 2016, Scientific Reports.

[34]  David S. Wishart,et al.  DrugBank: a comprehensive resource for in silico drug discovery and exploration , 2005, Nucleic Acids Res..

[35]  Ming Wen,et al.  Deep-Learning-Based Drug-Target Interaction Prediction. , 2017, Journal of proteome research.

[36]  Ruben Abagyan,et al.  Identifying ligands at orphan GPCRs: current status using structure‐based approaches , 2016, British journal of pharmacology.

[37]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[38]  Tom Sercu,et al.  Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences , 2021, Proceedings of the National Academy of Sciences.

[39]  John Canny,et al.  Evaluating Protein Transfer Learning with TAPE , 2019, bioRxiv.

[40]  Torsten Schöneberg,et al.  Revisiting the classification of adhesion GPCRs , 2019, Annals of the New York Academy of Sciences.

[41]  Ping Zhang,et al.  Interpretable Drug Target Prediction Using Deep Neural Representation , 2018, IJCAI.

[42]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[43]  Yi Guan,et al.  Two Methods for Mapping and Visualizing Associated Data on Phylogeny Using Ggtree. , 2018, Molecular biology and evolution.

[44]  Myle Ott,et al.  Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences , 2019, Proceedings of the National Academy of Sciences.

[45]  Xi Chen,et al.  The Binding Database: data management and interface design , 2002, Bioinform..

[46]  Bonnie Berger,et al.  Learning protein sequence embeddings using information from structure , 2019, ICLR.

[47]  Greg Van Houdt,et al.  A review on the long short-term memory model , 2020, Artificial Intelligence Review.

[48]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[49]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[50]  Scott Lundberg,et al.  A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[51]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.