IMSEQ - a fast and error aware approach to immunogenetic sequence analysis

MOTIVATION Recombined T- and B-cell receptor repertoires are increasingly being studied using next generation sequencing (NGS) in order to interrogate the repertoire composition as well as changes in the distribution of receptor clones under different physiological and disease states. This type of analysis requires efficient and unambiguous clonotype assignment to a large number of NGS read sequences, including the identification of the incorporated V and J gene segments and the CDR3 sequence. Current tools have deficits with respect to performance, accuracy and documentation of their underlying algorithms and usage. RESULTS We present IMSEQ, a method to derive clonotype repertoires from NGS data with sophisticated routines for handling errors stemming from PCR and sequencing artefacts. The application can handle different kinds of input data originating from single- or paired-end sequencing in different configurations and is generic regarding the species and gene of interest. We have carefully evaluated our method with simulated and real world data and show that IMSEQ is superior to other tools with respect to its clonotyping as well as standalone error correction and runtime performance. AVAILABILITY AND IMPLEMENTATION IMSEQ was implemented in C++ using the SeqAn library for efficient sequence analysis. It is freely available under the GPLv2 open source license and can be downloaded at www.imtools.org. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online. CONTACT lkuchenb@inf.fu-berlin.de or peter.robinson@charite.de.

[1]  Eugene W. Myers,et al.  A fast bit-vector algorithm for approximate string matching based on dynamic programming , 1998, JACM.

[2]  Yi Shi,et al.  TCRklass: A New K-String–Based Algorithm for Human and Mouse TCR Repertoire Characterization , 2015, The Journal of Immunology.

[3]  Mikhail Shugay,et al.  MiTCR: software for T-cell receptor sequencing data analysis , 2013, Nature Methods.

[4]  Eugene W. Myers,et al.  Efficient q-Gram Filters for Finding All epsilon-Matches over a Given Length , 2005, RECOMB.

[5]  Stephen R. Quake,et al.  Genetic measurement of memory B-cell recall using antibody repertoire sequencing , 2013, Proceedings of the National Academy of Sciences.

[6]  Mark M. Davis,et al.  Human responses to influenza vaccination show seroconversion signatures and convergent antibody rearrangements. , 2014, Cell host & microbe.

[7]  Manuel Holtgrewe,et al.  Mason – A Read Simulator for Second Generation Sequencing Data , 2010 .

[8]  Eugene W. Myers,et al.  Efficient q-Gram Filters for Finding All epsilon-Matches over a Given Length , 2006, J. Comput. Biol..

[9]  Juliane C. Dohm,et al.  Substantial biases in ultra-short read data sets from high-throughput DNA sequencing , 2008, Nucleic acids research.

[10]  Marie-Paule Lefranc,et al.  IMGT/V-QUEST: the highly customized and integrated system for IG and TR standardized V-J and V-D-J sequence analysis , 2008, Nucleic Acids Res..

[11]  John Shawe-Taylor,et al.  Decombinator: a tool for fast, efficient gene assignment in T-cell receptor sequences using a finite state machine , 2013, Bioinform..

[12]  P. Robinson,et al.  TCR Repertoire Analysis by Next Generation Sequencing Allows Complex Differential Diagnosis of T Cell–Related Pathology , 2013, American journal of transplantation : official journal of the American Society of Transplantation and the American Society of Transplant Surgeons.

[13]  Mikhail Shugay,et al.  Towards error-free profiling of immune repertoires , 2014, Nature Methods.

[14]  S. Tonegawa Somatic generation of antibody diversity , 1983, Nature.

[15]  Marie-Paule Lefranc,et al.  IMGT , the international ImMunoGeneTics information system , 2003 .

[16]  Jiajie Zhang,et al.  PEAR: a fast and accurate Illumina Paired-End reAd mergeR , 2013, Bioinform..

[17]  Kun-Mao Chao,et al.  Aligning two sequences within a specified diagonal band , 1992, Comput. Appl. Biosci..

[18]  David Wu,et al.  High-Throughput Sequencing Detects Minimal Residual Disease in Acute T Lymphoblastic Leukemia , 2012, Science Translational Medicine.

[19]  C. Janeway Immunobiology: The Immune System in Health and Disease , 1996 .

[20]  Ning Ma,et al.  IgBLAST: an immunoglobulin variable domain sequence analysis tool , 2013, Nucleic Acids Res..

[21]  C. Ponting,et al.  Sequencing depth and coverage: key considerations in genomic analyses , 2014, Nature Reviews Genetics.

[22]  Knut Reinert,et al.  SeqAn An efficient, generic C++ library for sequence analysis , 2008, BMC Bioinformatics.

[23]  A. Casrouge,et al.  A Direct Estimate of the Human αβ T Cell Receptor Diversity , 1999 .

[24]  Patrice Duroux,et al.  IMGT/HighV QUEST paradigm for T cell receptor IMGT clonotype diversity and next generation repertoire immunoprofiling , 2013, Nature Communications.