Designing meaningful continuous representations of T cell receptor sequences with deep generative models

T Cell Receptor (TCR) antigen binding underlies a key mechanism of the adaptive immune response yet the vast diversity of TCRs and the complexity of protein interactions limits our ability to build useful low dimensional representations of TCRs. To address the current limitations in TCR analysis we develop a capacity-controlled disentangling variational autoencoder trained using a dataset of approximately 100 million TCR sequences, that we name TCR-VALID. We design TCR-VALID such that the model representations are low-dimensional, continuous, disentangled, and sufficiently informative to provide high-quality TCR sequence de novo generation. We thoroughly quantify these properties of the representations, providing a framework for future protein representation learning in low dimensions. The continuity of TCR-VALID representations allows fast and accurate TCR clustering, benchmarked against other state-of-the-art TCR clustering tools and pre-trained language models.

[1]  Saeed Saremi,et al.  PropertyDAG: Multi-objective Bayesian optimization of partially ordered, mixed-variable properties for biological sequence design , 2022, ArXiv.

[2]  V. Buchholz,et al.  Signatures of recent activation identify a circulating T cell compartment containing tumor-specific antigen receptors with high avidity , 2022, Science Immunology.

[3]  Matt J. Kusner,et al.  Local Latent Space Bayesian Optimization over Structured Inputs , 2022, NeurIPS.

[4]  Howard Y. Chang,et al.  Divergent clonal differentiation trajectories of T cell exhaustion , 2021, bioRxiv.

[5]  Howard Y. Chang,et al.  TCR-BERT: learning the grammar of T-cell receptors for flexible antigen-xbinding analyses , 2021, bioRxiv.

[6]  John V. Heymach,et al.  Deep learning-based prediction of the T cell receptor–antigen binding specificity , 2021, Nature Machine Intelligence.

[7]  R. Sullivan,et al.  Overall Survival Benefit with Tebentafusp in Metastatic Uveal Melanoma. , 2021, The New England journal of medicine.

[8]  Ansuman T. Satpathy,et al.  High-throughput and single-cell T cell receptor sequencing technologies , 2021, Nature Methods.

[9]  Kris Laukens,et al.  ClusTCR: a python interface for rapid clustering of large sets of CDR3 sequences with unknown antigen specificity , 2021, Bioinform..

[10]  Stanislav Fort,et al.  Exploring the Limits of Out-of-Distribution Detection , 2021, NeurIPS.

[11]  G. Atwal,et al.  A framework for highly multiplexed dextramer mapping and prediction of T cell receptor sequences to antigen specificity , 2021, Science Advances.

[12]  Jannis Born,et al.  TITAN: T-cell receptor specificity prediction with bimodal attention networks , 2021, Bioinform..

[13]  D. Pardoll,et al.  DeepTCR is a deep learning framework for revealing sequence concepts within T-cell repertoires , 2021, Nature Communications.

[14]  Mark M. Davis,et al.  Global analysis of shared T cell specificities in human non-small cell lung cancer enables HLA inference and antigen discovery , 2021, Immunity.

[15]  M. Carbonneau,et al.  Measuring Disentanglement: A Review of Metrics , 2020, IEEE transactions on neural networks and learning systems.

[16]  Søren Hauberg,et al.  Learning meaningful representations of protein sequences , 2020, Nature Communications.

[17]  Jos'e Miguel Hern'andez-Lobato,et al.  Sample-Efficient Optimization in the Latent Space of Deep Generative Models via Weighted Retraining , 2020, NeurIPS.

[18]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[19]  T. Abdelzaher,et al.  ControlVAE: Controllable Variational Autoencoder , 2020, ICML.

[20]  Catherine J. Wu,et al.  Investigation of Antigen-Specific T-Cell Receptor Clusters in Human Cancers , 2019, Clinical Cancer Research.

[21]  R'emi Louf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[22]  Andrew K. Sewell,et al.  VDJdb in 2019: database extension, new analysis infrastructure and a T-cell receptor motif compendium , 2019, Nucleic Acids Res..

[23]  William S. DeWitt,et al.  Deep generative models for T cell receptor protein sequences , 2019, eLife.

[24]  C. Deane,et al.  Comparative Analysis of the CDR Loops of Antigen Receptors , 2019, bioRxiv.

[25]  Jasper Snoek,et al.  Likelihood Ratios for Out-of-Distribution Detection , 2019, NeurIPS.

[26]  I. Springer,et al.  Prediction of Specific TCR-Peptide Binding From Large Dictionaries of TCR-Peptide Pairs , 2019, bioRxiv.

[27]  Myle Ott,et al.  Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences , 2019, Proceedings of the National Academy of Sciences.

[28]  Xiaodong Liu,et al.  Cyclical Annealing Schedule: A Simple Approach to Mitigating KL Vanishing , 2019, NAACL.

[29]  T. Schumacher,et al.  Low and variable tumor reactivity of the intratumoral TCR repertoire in human cancers , 2018, Nature Medicine.

[30]  R. Sarpong,et al.  Bio-inspired synthesis of xishacorenes A, B, and C, and a new congener from fuscol† †Electronic supplementary information (ESI) available. See DOI: 10.1039/c9sc02572c , 2019, Chemical science.

[31]  David Berthelot,et al.  Understanding and Improving Interpolation in Autoencoders via an Adversarial Regularizer , 2018, ICLR.

[32]  Jamie K. Scott,et al.  iReceptor: A platform for querying and analyzing antibody/B‐cell and T‐cell receptor repertoire data across federated repositories , 2018, Immunological reviews.

[33]  M. Fehlings,et al.  Bystander CD8+ T cells are abundant and phenotypically distinct in human tumour infiltrates , 2018, Nature.

[34]  John M. Fonner,et al.  VDJServer: A Cloud-Based Analysis Portal and Data Commons for Immune Repertoire Sequences and Rearrangements , 2018, Front. Immunol..

[35]  Guillaume Desjardins,et al.  Understanding disentangling in β-VAE , 2018, ArXiv.

[36]  Christopher K. I. Williams,et al.  A Framework for the Quantitative Evaluation of Disentangled Representations , 2018, ICLR.

[37]  Kibok Lee,et al.  Training Confidence-calibrated Classifiers for Detecting Out-of-Distribution Samples , 2017, ICLR.

[38]  Andrew K. Sewell,et al.  VDJdb: a curated database of T-cell receptor sequences with known antigen specificity , 2017, Nucleic Acids Res..

[39]  Alun D. Preece,et al.  Interpretability of deep learning models: A survey of results , 2017, 2017 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computed, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI).

[40]  P. Bradley,et al.  Quantifiable predictive features define epitope-specific T cell receptor repertoires , 2017, Nature.

[41]  Alán Aspuru-Guzik,et al.  Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules , 2016, ACS central science.

[42]  Carl Doersch,et al.  Tutorial on Variational Autoencoders , 2016, ArXiv.

[43]  Thierry Mora,et al.  Quantifying lymphocyte receptor diversity , 2016, bioRxiv.

[44]  Grant Lythe,et al.  How many TCR clonotypes does a body maintain? , 2016, Journal of theoretical biology.

[45]  Charlotte M. Deane,et al.  ANARCI: antigen receptor numbering and receptor classification , 2015, Bioinform..

[46]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[47]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[48]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[49]  Daniel Kuhn,et al.  Predicting enzymatic function from global binding site descriptors , 2013, Proteins.

[50]  Andrew K. Sewell,et al.  Why must T cells be cross-reactive? , 2012, Nature Reviews Immunology.

[51]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[52]  G. Crooks,et al.  WebLogo: a sequence logo generator. , 2004, Genome research.

[53]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[54]  Jens Meiler,et al.  Generation and evaluation of dimension-reduced amino acid parameter representations by artificial neural networks , 2001 .

[55]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.