Context-Aware Amino Acid Embedding Advances Analysis of TCR-Epitope Interactions

Accurate prediction of binding interaction between T cell receptors (TCRs) and host cells is fundamental to understanding the regulation of the adaptive immune system as well as to developing data-driven approaches for personalized immunotherapy. While several machine learning models have been developed for this prediction task, the question of how to specifically embed TCR sequences into numeric representations remains largely unexplored compared to protein sequences in general. Here, we investigate whether the embedding models designed for protein sequences, and the most widely used BLOSUM-based embedding techniques are suitable for TCR analysis. Additionally, we present our context-aware amino acid embedding models (catELMo) designed explicitly for TCR analysis and trained on 4M unlabeled TCR sequences with no supervision. We validate the effectiveness of catELMo in both supervised and unsupervised scenarios by stacking the simplest models on top of our learned embeddings. For the supervised task, we choose the binding affinity prediction problem of TCR and epitope sequences and demonstrate notably significant performance gains (up by at least 14% AUC) compared to existing embedding models as well as the state-of-the-art methods. Additionally, we also show that our learned embeddings reduce more than 93% annotation cost while achieving comparable results to the state-of-the-art methods. In TCR clustering task (unsupervised), catELMo identifies TCR clusters that are more homogeneous and complete about their binding epitopes. Altogether, our catELMo trained without any explicit supervision interprets TCR sequences better and negates the need for complex deep neural network architectures.

[1]  Heewook Lee,et al.  ATM-TCR: TCR-Epitope Binding Affinity Prediction Using a Multi-Head Self-Attention Model , 2022, Frontiers in immunology.

[2]  Howard Y. Chang,et al.  TCR-BERT: learning the grammar of T-cell receptors for flexible antigen-xbinding analyses , 2021, bioRxiv.

[3]  John V. Heymach,et al.  Deep learning-based prediction of the T cell receptor–antigen binding specificity , 2021, Nature Machine Intelligence.

[4]  B. Peters,et al.  NetTCR-2.0 enables accurate prediction of TCR-peptide binding by using paired TCRα and β sequence data , 2021, Communications Biology.

[5]  Xiaowei Zhan,et al.  GIANA allows computationally-efficient TCR clustering and multi-disease repertoire classification by isometric transformation , 2021, Nature Communications.

[6]  M. Linial,et al.  ProteinBERT: a universal deep-learning model of protein sequence and function , 2021, bioRxiv.

[7]  H. Lähdesmäki,et al.  Predicting recognition between T cell receptors and epitopes with TCRGP , 2021, PLoS Comput. Biol..

[8]  Wout Bittremieux,et al.  Current challenges for unseen-epitope TCR interaction prediction and a new perspective derived from image classification , 2020, Briefings Bioinform..

[9]  Jennifer N. Dines,et al.  A large-scale database of T-cell receptor beta (TCRβ) sequences and binding associations from natural and synthetic exposure to SARS-CoV-2 , 2020, Research square.

[10]  B. Rost,et al.  ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing , 2020, bioRxiv.

[11]  J. Blankson,et al.  A T Cell Receptor Sequencing-Based Assay Identifies Cross-Reactive Recall CD8+ T Cell Clonotypes Against Autologous HIV-1 Epitope Variants , 2020, Frontiers in Immunology.

[12]  D. Ghersi,et al.  Epstein-Barr Virus Epitope–Major Histocompatibility Complex Interaction Combined with Convergent Recombination Drives Selection of Diverse T Cell Receptor α and β Repertoires , 2020, mBio.

[13]  B. Rost,et al.  Modeling aspects of the language of life through transfer-learning protein sequences , 2019, BMC Bioinformatics.

[14]  Sofie Gielis,et al.  Detection of Enriched T Cell Epitope Specificity in Full T Cell Receptor Sequence Repertoires , 2019, Front. Immunol..

[15]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[16]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[17]  I. Springer,et al.  Prediction of Specific TCR-Peptide Binding From Large Dictionaries of TCR-Peptide Pairs , 2019, bioRxiv.

[18]  Huanming Yang,et al.  PIRD: Pan immune repertoire database , 2018, bioRxiv.

[19]  Zachary Wu,et al.  Learned protein embeddings for machine learning , 2018, Bioinformatics.

[20]  William S. DeWitt,et al.  A Diverse Lipid Antigen–Specific TCR Repertoire Is Clonally Expanded during Active Tuberculosis , 2018, The Journal of Immunology.

[21]  G. Mortier,et al.  Memory CD4+ T cell receptor repertoire data mining as a tool for identifying cytomegalovirus serostatus , 2018, Genes & Immunity.

[22]  Jaime Prilusky,et al.  McPAS‐TCR: a manually curated catalogue of pathology‐associated T cell receptor sequences , 2017, Bioinform..

[23]  Andrew K. Sewell,et al.  VDJdb: a curated database of T-cell receptor sequences with known antigen specificity , 2017, Nucleic Acids Res..

[24]  P. Bradley,et al.  Quantifiable predictive features define epitope-specific T cell receptor repertoires , 2017, Nature.

[25]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[26]  Kenji Doya,et al.  Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning , 2017, Neural Networks.

[27]  James M. Hogan,et al.  Distributed Representations for Biological Sequence Analysis , 2016, ArXiv.

[28]  Ehsaneddin Asgari,et al.  Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics , 2015, PloS one.

[29]  Alexander M. Rush,et al.  Character-Aware Neural Language Models , 2015, AAAI.

[30]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[31]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[32]  Yoshua Bengio,et al.  How transferable are features in deep neural networks? , 2014, NIPS.

[33]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[34]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[35]  Andrew K. Sewell,et al.  Why must T cells be cross-reactive? , 2012, Nature Reviews Immunology.

[36]  Mark M Davis,et al.  How T cells 'see' antigen , 2005, Nature Immunology.

[37]  C. Wülfing,et al.  T cell receptor (TCR) clustering in the immunological synapse integrates TCR and costimulatory signaling in selected T cells. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[38]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[39]  T. Schumacher,et al.  T-cell-receptor gene therapy , 2002, Nature Reviews Immunology.

[40]  H. Sbai,et al.  Use of T cell epitopes for vaccine development. , 2001, Current drug targets. Infectious disorders.

[41]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[42]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[43]  Mark M. Davis,et al.  T-cell antigen receptor genes and T-cell recognition , 1988, Nature.

[44]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[45]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[46]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .