Performance Evaluation of Viral Infection Diagnosis using T-Cell Receptor Sequence and Artificial Intelligence

The adaptive immune system expresses millions of different receptors that detect and fight pathogens encountered throughout life. These receptors are encoded by unique DNA sequences that allow immune cells to express millions of different receptors. High-throughput sequencing and analyses of immune cell receptor sequences present a unique opportunity to inform our understanding of immunological responses to infections and to evaluate vaccine efficacy. Even after the infection is eliminated, pathogen-specific immune cells and their receptor sequences are present at higher frequencies than prior to infection, and their increase in frequency prevents secondary infections. As a result of their persistence in the body, they may be useful for diagnosing infections and evaluating vaccine efficacy as a stable biomarker. However, this process requires thorough analysis of massive datasets at an accuracy beyond traditional statistical tests to diagnose infectious statuses based on sequence analyses. Here we evaluate various machine learning and deep learning algorithms to measure the performance of the identification and diagnosis of specific viral infections or vaccination statuses using the publicly available mouse (monkeypox infection and smallpox vaccination) and human (cytomegalovirus serostatus) T-cell receptor sequenced datasets. Our intensive experiments hold the potential for effective screening of disease status, including recently encountered strains like the ongoing SARS-CoV-2 pandemic.

[1]  E.J. Tkacz,et al.  Feature extraction for improving the support vector machine biomedical data classifier performance , 2008, 2008 International Conference on Information Technology and Applications in Biomedicine.

[2]  P. Sims,et al.  Single-cell transcriptomics of human T cells reveals tissue and activation signatures in health and disease , 2019, Nature Communications.

[3]  D. Campana,et al.  Deep-sequencing approach for minimal residual disease detection in acute lymphoblastic leukemia. , 2012, Blood.

[4]  Kris Laukens,et al.  On the viability of unsupervised T-cell receptor sequence clustering for epitope preference , 2018, Bioinform..

[5]  M. Metzker Sequencing technologies — the next generation , 2010, Nature Reviews Genetics.

[6]  Aboul Ella Hassanien,et al.  Ensemble classifiers for biomedical data: Performance evaluation , 2013, 2013 8th International Conference on Computer Engineering & Systems (ICCES).

[7]  William S. DeWitt,et al.  Dynamics of the Cytotoxic T Cell Response to a Model of Acute Viral Infection , 2015, Journal of Virology.

[8]  D. Laydon,et al.  Estimating T-cell repertoire diversity: limitations of classical estimators and a new approach , 2015, Philosophical Transactions of the Royal Society B: Biological Sciences.

[9]  M. Vignali,et al.  T‐cell receptor profiling in cancer , 2015, Molecular oncology.

[10]  Jie Hou,et al.  DeepQA: improving the estimation of single protein model quality with deep belief networks , 2016, BMC Bioinformatics.

[11]  A. Ng Feature selection, L1 vs. L2 regularization, and rotational invariance , 2004, Twenty-first international conference on Machine learning - ICML '04.

[12]  Rob J. De Boer,et al.  RTCR: a pipeline for complete and accurate recovery of T cell repertoires from high throughput sequencing data , 2016, Bioinform..

[13]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[14]  D. Chaplin Overview of the immune response. , 2003, The Journal of allergy and clinical immunology.

[15]  Hugh E. Olsen,et al.  The Oxford Nanopore MinION: delivery of nanopore sequencing to the genomics community , 2016, Genome Biology.

[16]  A. Lanzavecchia,et al.  Expression of two T cell receptor alpha chains: dual receptor T cells. , 1993, Science.

[17]  Pierre Baldi,et al.  Understanding Dropout , 2013, NIPS.

[18]  Richard J. DiPaolo,et al.  Identifying and Tracking Low-Frequency Virus-Specific TCR Clonotypes Using High-Throughput Sequencing , 2018, Cell reports.

[19]  Taeho Jo,et al.  Improving protein fold recognition by random forest , 2014, BMC Bioinformatics.

[20]  G. Crawford,et al.  DNase-seq: a high-resolution technique for mapping active gene regulatory elements across the genome from mammalian cells. , 2010, Cold Spring Harbor protocols.

[21]  M. Pagani,et al.  Single Cell T Cell Receptor Sequencing: Techniques and Future Challenges , 2018, Front. Immunol..

[22]  Chester Ni,et al.  A Novel Approach to Tracking Antigen-Experienced CD4 T Cells into Functional Compartments via Tandem Deep and Shallow TCR Clonotyping , 2013, The Journal of Immunology.

[23]  Randal S. Olson,et al.  Automating Biomedical Data Science Through Tree-Based Pipeline Optimization , 2016, EvoApplications.

[24]  Bjoern H. Menze,et al.  A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data , 2009, BMC Bioinformatics.

[25]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[26]  R. Bender,et al.  Ordinal Logistic Regression in Medical Research , 1997, Journal of the Royal College of Physicians of London.

[27]  J. Xu,et al.  Diversity in the CDR3 region of V(H) is sufficient for most antibody specificities. , 2000, Immunity.

[28]  J. Calis,et al.  Characterizing immune repertoires by high throughput sequencing: strategies and applications. , 2014, Trends in immunology.

[29]  Sofie Gielis,et al.  Detection of Enriched T Cell Epitope Specificity in Full T Cell Receptor Sequence Repertoires , 2019, Front. Immunol..

[30]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .

[31]  Ellen Poliakoff,et al.  Machine learning algorithm validation with a limited sample size , 2019, PloS one.

[32]  Mikhail Shugay,et al.  MiXCR: software for comprehensive adaptive immunity profiling , 2015, Nature Methods.

[33]  Pierre Baldi,et al.  Improved residue contact prediction using support vector machines and a large feature set , 2007, BMC Bioinformatics.

[34]  Daniel C Douek,et al.  Bias in the αβ T‐cell repertoire: implications for disease pathogenesis and vaccination , 2011, Immunology and cell biology.

[35]  Abigail Wacher,et al.  Comprehensive assessment of T-cell receptor beta-chain diversity in alphabeta T cells. , 2009, Blood.

[36]  V. Carlton,et al.  Immunoglobulin and T cell receptor gene high-throughput sequencing quantifies minimal residual disease in acute lymphoblastic leukemia and predicts post-transplantation relapse and survival. , 2014, Biology of blood and marrow transplantation : journal of the American Society for Blood and Marrow Transplantation.

[37]  T. Karlsen,et al.  Overview of methodologies for T-cell receptor repertoire analysis , 2017, BMC Biotechnology.

[38]  Mikhail Pogorelyy,et al.  tcR: an R package for T cell receptor repertoire advanced data analysis , 2015, BMC Bioinformatics.

[39]  Jiahuai Han,et al.  Determinants of public T cell responses , 2012, Cell Research.

[40]  S. Turner,et al.  Real-time DNA sequencing from single polymerase molecules. , 2010, Methods in enzymology.

[41]  Abien Fred Agarap Deep Learning using Rectified Linear Units (ReLU) , 2018, ArXiv.

[42]  William S. DeWitt,et al.  Immunosequencing identifies signatures of cytomegalovirus exposure history and HLA-mediated effects on the T cell repertoire , 2017, Nature Genetics.

[43]  Xiao-Hu Yu,et al.  Efficient Backpropagation Learning Using Optimal Learning Rate and Momentum , 1997, Neural Networks.

[44]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..