Motivation Clustering of antigen-specific T cell receptor repertoire (TCRR) sequences is challenging. The recently published tool GLIPH aims to solve this problem. However, clustering large repertoires takes several days to weeks, making its use impractical in larger studies. In addition, the methodology used in GLIPH suffers from several shortcomings, including non-determinism, potential loss of significant antigen-specific sequences or inclusion of too many unspecific sequences. Results We present an algorithm for clustering TCRR sequences that scales efficiently to large repertoires. We clustered 26 real datasets with up to 62 000 unique CDR3β sequences using both GLIPH and an implementation of our method called ting. While GLIPH required multiple weeks, ting only needed about one hour for the same task. In addition, we found that in naïve repertoires, where no or very few antigen-specific CDR3 sequences or clusters should exist, our method indeed selects fewer sequences. Availability Our method has been implemented in Python as a tool called ting, using numpy and NetworkX. It is available on GitHub (https://github.com/FelixMoelder/ting) and on PyPI under the MIT license. Contact felix.moelder@uni-due.de or sven.rahmann@uni-due.de
[1]
Michael J. Fischer,et al.
An improved equivalence algorithm
,
1964,
CACM.
[2]
M. Kendall.
Statistical Methods for Research Workers
,
1937,
Nature.
[3]
Peter N. Robinson,et al.
IMSEQ - a fast and error aware approach to immunogenetic sequence analysis
,
2015,
Bioinform..
[4]
Alessandro Sette,et al.
Identifying specificity groups in the T cell receptor repertoire
,
2017,
Nature.
[5]
Andrew K. Sewell,et al.
Why must T cells be cross-reactive?
,
2012,
Nature Reviews Immunology.
[6]
Xin-She Yang,et al.
Introduction to Algorithms
,
2021,
Nature-Inspired Optimization Algorithms.
[7]
Travis E. Oliphant,et al.
Guide to NumPy
,
2015
.