Feature selection using a one dimensional naïve Bayes’ classifier increases the accuracy of support vector machine classification of CDR3 repertoires

Motivation: Somatic DNA recombination, the hallmark of vertebrate adaptive immunity, has the potential to generate a vast diversity of antigen receptor sequences. How this diversity captures antigen specificity remains incompletely understood. In this study we use high throughput sequencing to compare the global changes in T cell receptor &bgr; chain complementarity determining region 3 (CDR3&bgr;) sequences following immunization with ovalbumin administered with complete Freund's adjuvant (CFA) or CFA alone. Results: The CDR3&bgr; sequences were deconstructed into short stretches of overlapping contiguous amino acids. The motifs were ranked according to a one‐dimensional Bayesian classifier score comparing their frequency in the repertoires of the two immunization classes. The top ranking motifs were selected and used to create feature vectors which were used to train a support vector machine. The support vector machine achieved high classification scores in a leave‐one‐out validation test reaching >90% in some cases. Summary: The study describes a novel two‐stage classification strategy combining a one‐dimensional Bayesian classifier with a support vector machine. Using this approach we demonstrate that the frequency of a small number of linear motifs three amino acids in length can accurately identify a CD4 T cell response to ovalbumin against a background response to the complex mixture of antigens which characterize Complete Freund's Adjuvant. Availability and implementation: The sequence data is available at www.ncbi.nlm.nih.gov/sra/?term¼SRP075893. The Decombinator package is available at github.com/innate2adaptive/Decombinator. The R package e1071 is available at the CRAN repository https://cran.r‐project.org/web/packages/e1071/index.html. Contact: b.chain@ucl.ac.uk Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  Nello Cristianini,et al.  Support vector machines , 2009 .

[2]  C. Janeway Approaching the asymptote? Evolution and revolution in immunology. , 1989, Cold Spring Harbor symposia on quantitative biology.

[3]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[4]  Katherine Kedzierska,et al.  Conserved T cell receptor usage in primary and recall responses to an immunodominant influenza virus nucleoprotein epitope. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Thierry Mora,et al.  Statistical inference of the generation probability of T-cell receptors from sequence repertoires , 2012, Proceedings of the National Academy of Sciences.

[6]  George Varghese,et al.  Using Genome Query Language to uncover genetic variation , 2014, Bioinform..

[7]  Mark M. Davis,et al.  T‐Cell Receptor V‐Region Usage and Antigen Specificity , 1995, Annals of the New York Academy of Sciences.

[8]  Wilfred Ndifon,et al.  Chromatin conformation governs T-cell receptor Jβ gene segment usage , 2012, Proceedings of the National Academy of Sciences.

[9]  B. Chain,et al.  The injured cell: the role of the dendritic cell system as a sentinel receptor pathway. , 1995, Immunology today.

[10]  John Shawe-Taylor,et al.  Tracking global changes induced in the CD4 T-cell receptor repertoire by immunization with a complex antigen using short stretches of CDR3 protein sequence , 2014, bioRxiv.

[11]  N. Friedman,et al.  T-cell receptor repertoires share a restricted set of public and abundant CDR3 sequences that are associated with self-related immunity , 2014, Genome research.

[12]  Daniel C. Douek,et al.  Convergent recombination shapes the clonotypic landscape of the naïve T-cell repertoire , 2010, Proceedings of the National Academy of Sciences.

[13]  H. Grey,et al.  Antigen recognition by H-2-restricted T cells. I. Cell-free antigen processing , 1983, The Journal of experimental medicine.

[14]  Daniel C Douek,et al.  Bias in the αβ T‐cell repertoire: implications for disease pathogenesis and vaccination , 2011, Immunology and cell biology.

[15]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.