Automatic generation of ground truth data for the evaluation of clonal grouping methods in B-cell populations

Motivation The adaptive B-cell response is driven by the expansion, somatic hypermutation, and selection of B-cell clones. Their number, size and sequence diversity are essential characteristics of B-cell populations. Identifying clones in B-cell populations is central to several repertoire studies such as statistical analysis, repertoire comparisons, and clonal tracking. Several clonal grouping methods have been developed to group sequences from B-cell immune repertoires. Such methods have been principally evaluated on simulated benchmarks since experimental data containing clonally related sequences can be difficult to obtain. However, experimental data might contains multiple sources of sequence variability hampering their artificial reproduction. Therefore, the generation of high precision ground truth data that preserves real repertoire distributions is necessary to accurately evaluate clonal grouping methods. Results We proposed a novel methodology to generate ground truth data sets from real repertoires. Our procedure requires V(D)J annotations to obtain the initial clones, and iteratively apply an optimisation step that moves sequences among clones to increase their cohesion and separation. We first showed that our method was able to identify clonally-related sequences in simulated repertoires with higher mutation rates, accurately. Next, we demonstrated how real benchmarks (generated by our method) constitute a challenge for clonal grouping methods, when comparing the performance of a widely used clonal grouping algorithm on several generated benchmarks. Our method can be used to generate a high number of benchmarks and contribute to construct more accurate clonal grouping tools. Availability and implementation The source code and generated data sets are freely available at github.com/NikaAb/BCR_GTG

[1]  T. Ohshima,et al.  Stimulated emission from nitrogen-vacancy centres in diamond , 2016, Nature Communications.

[2]  Mikhail Shugay,et al.  MiXCR: software for comprehensive adaptive immunity profiling , 2015, Nature Methods.

[3]  Steven H. Kleinstein,et al.  Change-O: a toolkit for analyzing large-scale B cell immunoglobulin repertoire sequencing data , 2015, Bioinform..

[4]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[5]  Marie-Paule Lefranc,et al.  IMGT , the international ImMunoGeneTics information system , 2003 .

[6]  Steven H. Kleinstein,et al.  A spectral clustering-based method for identifying clones from high-throughput B cell repertoire sequencing data , 2018, Bioinform..

[7]  Jiajie Zhang,et al.  PEAR: a fast and accurate Illumina Paired-End reAd mergeR , 2013, Bioinform..

[8]  Aaron M. Rosenfeld,et al.  ImmuneDB: a system for the analysis and exploration of high-throughput adaptive immune receptor sequencing data , 2017, Bioinform..

[9]  Patrice Duroux,et al.  IMGT/HIGHV-QUEST: THE IMGT® WEB PORTAL FOR IMMUNOGLOBULIN (IG) OR ANTIBODY AND T CELL RECEPTOR (TR) ANALYSIS FROM NGS HIGH THROUGHPUT AND DEEP SEQUENCING , 2012 .

[10]  Gerson Zaverucha,et al.  Evaluation and improvements of clustering algorithms for detecting remote homologous protein families , 2015, BMC Bioinformatics.

[11]  William S. DeWitt,et al.  Using Genotype Abundance to Improve Phylogenetic Inference , 2017, Molecular biology and evolution.

[12]  William T. Hu,et al.  Extrafollicular B cell responses correlate with neutralizing antibodies and morbidity in COVID-19 , 2020, Nature Immunology.

[13]  F. Watzinger,et al.  Evaluation of candidate control genes for diagnosis and residual disease detection in leukemic patients using ‘real-time’ quantitative reverse-transcriptase polymerase chain reaction (RQ-PCR) – a Europe against cancer program , 2003, Leukemia.

[14]  V. Bansal,et al.  Genome-wide association study results for educational attainment aid in identifying genetic heterogeneity of schizophrenia , 2018, Nature Communications.

[15]  K. P. Murphy,et al.  Janeway's immunobiology , 2007 .

[16]  V. Giudicelli,et al.  IMGT(®) tools for the nucleotide analysis of immunoglobulin (IG) and T cell receptor (TR) V-(D)-J repertoires, polymorphisms, and IG mutations: IMGT/V-QUEST and IMGT/HighV-QUEST for NGS. , 2012, Methods in molecular biology.

[17]  David Kipling,et al.  Ageing of the B-cell repertoire , 2015, Philosophical Transactions of the Royal Society B: Biological Sciences.

[18]  Quentin Marcou,et al.  High-throughput immune repertoire analysis with IGoR , 2017, Nature Communications.

[19]  K. Srinivasan,et al.  IgG1 memory B cells keep the memory of IgE responses , 2017, Nature Communications.

[20]  Steven H. Kleinstein,et al.  Models of Somatic Hypermutation Targeting and Substitution Based on Synonymous Mutations from High-Throughput Immunoglobulin Sequencing Data , 2013, Front. Immunol..

[21]  Steven H. Kleinstein,et al.  This information is current as Revealed through Deep Sequencing Formation in Myasthenia Gravis Patients Dysregulation of B Cell Repertoire , 2017 .

[22]  Eline T. Luning Prak,et al.  The analysis of clonal expansions in normal and autoimmune B cell repertoires , 2015, Philosophical Transactions of the Royal Society B: Biological Sciences.

[23]  IV FrederickA.Matsen,et al.  Consistency of VDJ Rearrangement and Substitution Parameters Enables Accurate B Cell Receptor Sequence Annotation , 2015, PLoS Comput. Biol..

[24]  M Hummel,et al.  Design and standardization of PCR primers and protocols for detection of clonal immunoglobulin and T-cell receptor gene recombinations in suspect lymphoproliferations: Report of the BIOMED-2 Concerted Action BMH4-CT98-3936 , 2003, Leukemia.

[25]  IV FrederickA.Matsen,et al.  Likelihood-Based Inference of B Cell Clonal Families , 2016, PLoS Comput. Biol..

[26]  John M. Fonner,et al.  VDJServer: A Cloud-Based Analysis Portal and Data Commons for Immune Repertoire Sequences and Rearrangements , 2018, Front. Immunol..

[27]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[28]  S. Tonegawa,et al.  Somatic generation of antibody diversity. , 1976, Nature.