Reproducibility of Illumina platform deep sequencing errors allows accurate determination of DNA barcodes in cells

BackgroundNext generation sequencing (NGS) of amplified DNA is a powerful tool to describe genetic heterogeneity within cell populations that can both be used to investigate the clonal structure of cell populations and to perform genetic lineage tracing. For applications in which both abundant and rare sequences are biologically relevant, the relatively high error rate of NGS techniques complicates data analysis, as it is difficult to distinguish rare true sequences from spurious sequences that are generated by PCR or sequencing errors. This issue, for instance, applies to cellular barcoding strategies that aim to follow the amount and type of offspring of single cells, by supplying these with unique heritable DNA tags.ResultsHere, we use genetic barcoding data from the Illumina HiSeq platform to show that straightforward read threshold-based filtering of data is typically insufficient to filter out spurious barcodes. Importantly, we demonstrate that specific sequencing errors occur at an approximately constant rate across different samples that are sequenced in parallel. We exploit this observation by developing a novel approach to filter out spurious sequences.ConclusionsApplication of our new method demonstrates its value in the identification of true sequences amongst spurious sequences in biological data sets.

[1]  Alex P. Reiner,et al.  Massively parallel sequencing: the new frontier of hematologic genomics. , 2013, Blood.

[2]  M. Hirst,et al.  Clonal analysis via barcoding reveals diverse growth and differentiation of transplanted mouse and human mammary stem cells. , 2014, Cell stem cell.

[3]  A. Sivachenko,et al.  Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples , 2013, Nature Biotechnology.

[4]  Benjamin D. Simons,et al.  Defining the mode of tumour growth by clonal analysis , 2012, Nature.

[5]  Jeffrey A. Hussmann,et al.  High-throughput DNA sequencing errors are reduced by orders of magnitude using circle sequencing , 2013, Proceedings of the National Academy of Sciences.

[6]  Pavel Skums,et al.  Efficient error correction for next-generation sequencing of viral amplicons , 2012, BMC Bioinformatics.

[7]  J. van Rheenen,et al.  Brief Report: Intravital Imaging of Cancer Stem Cell Plasticity in Mammary Tumors , 2012, Stem cells.

[8]  T. Schumacher,et al.  Cellular barcoding: a technical appraisal. , 2014, Experimental hematology.

[9]  T. Schumacher,et al.  Mapping the life histories of T cells , 2010, Nature Reviews Immunology.

[10]  Niko Beerenwinkel,et al.  Error correction of next-generation sequencing data and reliable estimation of HIV quasispecies , 2010, Nucleic acids research.

[11]  B. Faircloth,et al.  Not All Sequence Tags Are Created Equal: Designing and Validating Sequence Identification Tags Robust to Indels , 2012, PloS one.

[12]  Lior Pachter,et al.  RESEARCH ARTICLE Open Access Identification and correction of systematic error in high-throughput sequence data , 2022 .

[13]  Sara El-Metwally,et al.  Next-Generation Sequence Assembly: Four Stages of Data Processing and Computational Challenges , 2013, PLoS Comput. Biol..

[14]  Jussi Taipale,et al.  Counting absolute number of molecules using unique molecular identifiers , 2011 .

[15]  Chuanfeng Wu,et al.  High efficiency restriction enzyme-free linear amplification-mediated polymerase chain reaction approach for tracking lentiviral integration sites does not abrogate retrieval bias. , 2013, Human gene therapy.

[16]  Mikhail Shugay,et al.  Towards error-free profiling of immune repertoires , 2014, Nature Methods.

[17]  Cheng Cheng,et al.  Identification of errors introduced during high throughput sequencing of the T cell receptor repertoire , 2011, BMC Genomics.

[18]  Nicholas Eriksson,et al.  ShoRAH: estimating the genetic diversity of a mixed sample from next-generation sequencing data , 2011, BMC Bioinformatics.

[19]  Abigail Wacher,et al.  Comprehensive assessment of T-cell receptor beta-chain diversity in alphabeta T cells. , 2009, Blood.

[20]  Hans Clevers,et al.  Lineage Tracing Reveals Lgr5+ Stem Cell Activity in Mouse Intestinal Adenomas , 2012, Science.

[21]  Ramon Arens,et al.  Dissecting T cell lineage relationships by cellular barcoding , 2008, The Journal of experimental medicine.

[22]  C. Quince,et al.  Accurate determination of microbial diversity from 454 pyrosequencing data , 2009, Nature Methods.

[23]  Sergey Lukyanov,et al.  Next generation sequencing for TCR repertoire profiling: Platform‐specific features and correction algorithms , 2012, European journal of immunology.

[24]  James E Crowe,et al.  Impact of new sequencing technologies on studies of the human B cell repertoire. , 2013, Current opinion in immunology.

[25]  Tzong-Shiue Yu,et al.  A restricted cell population propagates glioblastoma growth after chemotherapy , 2012 .

[26]  Nicholas W. Wood,et al.  A robust model for read count data in exome sequencing experiments and implications for copy number variant calling , 2012, Bioinform..

[27]  Guillaume J. Filion,et al.  Starcode: sequence clustering based on all-pairs search , 2015, Bioinform..

[28]  K. Kinzler,et al.  Detection and quantification of rare mutations with massively parallel sequencing , 2011, Proceedings of the National Academy of Sciences.

[29]  Alexander Schönhuth,et al.  Discovering motifs that induce sequencing errors , 2013, BMC Bioinformatics.

[30]  Leonid V. Bystrykh,et al.  Generalized DNA Barcode Design Based on Hamming Codes , 2012, PloS one.

[31]  A. Zador,et al.  In vivo generation of DNA sequence diversity for cellular barcoding , 2014, bioRxiv.

[32]  Jesse J. Salk,et al.  Detection of ultra-rare mutations by next-generation sequencing , 2012, Proceedings of the National Academy of Sciences.

[33]  L. Bystrykh,et al.  Asymmetry in skeletal distribution of mouse hematopoietic stem cell clones and their equilibration by mobilizing cytokines , 2014, The Journal of experimental medicine.

[34]  Hua Li,et al.  Accuracy of RNA-Seq and its dependence on sequencing depth , 2012, BMC Bioinformatics.

[35]  Joost B. Beltman,et al.  Heterogeneous Differentiation Patterns of Individual CD8+ T Cells , 2013, Science.

[36]  H. Robins Immunosequencing: applications of immune repertoire deep sequencing. , 2013, Current opinion in immunology.

[37]  T. Schumacher,et al.  Diverse and heritable lineage imprinting of early haematopoietic progenitors , 2013, Nature.

[38]  Ramit Mehr,et al.  Models and methods for analysis of lymphocyte repertoire generation, development, selection and evolution. , 2012, Immunology letters.

[39]  Irving L. Weissman,et al.  Tracking single hematopoietic stem cells in vivo using high-throughput sequencing in conjunction with viral genetic barcoding , 2011, Nature Biotechnology.

[40]  S. Linnarsson,et al.  Counting absolute numbers of molecules using unique molecular identifiers , 2011, Nature Methods.

[41]  Claus V. Hallwirth,et al.  Impact of next-generation sequencing error on analysis of barcoded plasmid libraries of known complexity and sequence , 2014, Nucleic acids research.

[42]  Tilo Buschmann,et al.  Levenshtein error-correcting barcodes for multiplexed DNA sequencing , 2013, BMC Bioinformatics.

[43]  M. Hirst,et al.  Analysis of the clonal growth and differentiation dynamics of primitive barcoded human cord blood cells in NSG mice. , 2013, Blood.

[44]  L. Bystrykh,et al.  Heterogeneity of young and aged murine hematopoietic stem cells revealed by quantitative clonal analysis using cellular barcoding. , 2013, Blood.

[45]  R. Wilson,et al.  The Next-Generation Sequencing Revolution and Its Impact on Genomics , 2013, Cell.

[46]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[47]  Ramon Arens,et al.  Recruitment of Antigen-Specific CD8+ T Cells in Response to Infection Is Markedly Efficient , 2009, Science.

[48]  Volker Roth,et al.  Deep Sequencing of a Genetically Heterogeneous Sample: Local Haplotype Reconstruction and Read Error Correction , 2009, RECOMB.

[49]  Gayle M. Wittenberg,et al.  EDAR: An Efficient Error Detection and Removal Algorithm for Next Generation Sequencing Data , 2010, J. Comput. Biol..

[50]  David Mittelman,et al.  Lentiviral and targeted cellular barcoding reveals ongoing clonal dynamics of cell lines in vitro and in vivo , 2014, Genome Biology.

[51]  Shoshannah L. Roth,et al.  A method to sequence and quantify DNA integration for monitoring outcome in gene therapy , 2011, Nucleic acids research.