Indel-correcting DNA barcodes for high-throughput sequencing

Significance Modern high-throughput biological assays study pooled populations of individual members by labeling each member with a unique DNA sequence called a “barcode.” DNA barcodes are frequently corrupted by DNA synthesis and sequencing errors, leading to significant data loss and incorrect data interpretation. Here, we describe an error correction strategy to improve the efficiency and statistical power of DNA barcodes. Our strategy accurately handles insertions and deletions (indels) in DNA barcodes, the most common type of error encountered during DNA synthesis and sequencing, resulting in order-of-magnitude increases in accuracy, efficiency, and signal-to-noise ratio. The accompanying software package makes deployment of these barcodes straightforward for the broader experimental scientist community. Many large-scale, high-throughput experiments use DNA barcodes, short DNA sequences prepended to DNA libraries, for identification of individuals in pooled biomolecule populations. However, DNA synthesis and sequencing errors confound the correct interpretation of observed barcodes and can lead to significant data loss or spurious results. Widely used error-correcting codes borrowed from computer science (e.g., Hamming, Levenshtein codes) do not properly account for insertions and deletions (indels) in DNA barcodes, even though deletions are the most common type of synthesis error. Here, we present and experimentally validate filled/truncated right end edit (FREE) barcodes, which correct substitution, insertion, and deletion errors, even when these errors alter the barcode length. FREE barcodes are designed with experimental considerations in mind, including balanced guanine-cytosine (GC) content, minimal homopolymer runs, and reduced internal hairpin propensity. We generate and include lists of barcodes with different lengths and error correction levels that may be useful in diverse high-throughput applications, including >106 single-error–correcting 16-mers that strike a balance between decoding accuracy, barcode length, and library size. Moreover, concatenating two or more FREE codes into a single barcode increases the available barcode space combinatorially, generating lists with >1015 error-correcting barcodes. The included software for creating barcode libraries and decoding sequenced barcodes is efficient and designed to be user-friendly for the general biology community.

[1]  F. MacWilliams,et al.  The Theory of Error-Correcting Codes , 1977 .

[2]  Joakim Lundeberg,et al.  TagGD: Fast and Accurate Software for DNA Tag Generation and Demultiplexing , 2013, PloS one.

[3]  Evan Z. Macosko,et al.  Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets , 2015, Cell.

[4]  Allon M. Klein,et al.  Droplet Barcoding for Single-Cell Transcriptomics Applied to Embryonic Stem Cells , 2015, Cell.

[5]  F. Lemmermeyer Error-correcting Codes , 2005 .

[6]  Christoph E. Dumelin,et al.  Encoded self-assembling chemical libraries , 2004, Nature Biotechnology.

[7]  Allon M. Klein,et al.  Single-cell barcoding and sequencing using droplet microfluidics , 2016, Nature Protocols.

[8]  Satoru Miyano,et al.  Large-scale DNA Barcode Library Generation for Biomolecule Identification in High-throughput Screens , 2017, Scientific Reports.

[9]  Dario Neri,et al.  DNA-encoded chemical libraries: foundations and applications in lead discovery. , 2016, Drug discovery today.

[10]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[11]  W. W. Peterson,et al.  Error-Correcting Codes. , 1962 .

[12]  Angus M. Sidore,et al.  Multiplexed Gene Synthesis in Emulsions for Exploring Protein Functional Landscapes , 2017 .

[13]  Richard W. Hamming,et al.  Error detecting and error correcting codes , 1950 .

[14]  Tilo Buschmann,et al.  Levenshtein error-correcting barcodes for multiplexed DNA sequencing , 2013, BMC Bioinformatics.

[15]  L. Hood,et al.  Integrated barcode chips for rapid, multiplexed analysis of proteins in microliter quantities of blood , 2008, Nature Biotechnology.

[16]  A. J. van Zanten,et al.  Lexicographic Order and Linearity , 1997, Des. Codes Cryptogr..

[17]  Rong Fan,et al.  A Clinical Microchip for Evaluation of Single Immune Cells Reveals High Functional Heterogeneity in Phenotypically Similar T Cells Nih Public Access Author Manuscript Design Rationale and Detection Limit of the Scbc Online Methods Microchip Fabrication On-chip Secretion Profiling Supplementary Mater , 2022 .

[18]  Christoph E. Dumelin,et al.  Encoded Library Synthesis Using Chemical Ligation and the Discovery of sEH Inhibitors from a 334-Million Member Library , 2015, Scientific Reports.

[19]  J. Loparo,et al.  Mapping DNA polymerase errors by single-molecule sequencing , 2016, Nucleic acids research.

[20]  Justin Petrone DNA writers attract investors , 2016, Nature Biotechnology.

[21]  H. Swerdlow,et al.  A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers , 2012, BMC Genomics.

[22]  Adam H. Marblestone,et al.  Gene Assembly from Chip‐Synthesized Oligonucleotides , 2012, Current protocols in chemical biology.

[23]  Hanlee P. Ji,et al.  Haplotyping germline and cancer genomes using high-throughput linked-read sequencing , 2015, Nature Biotechnology.

[24]  Jacob O Kitzman,et al.  Haplotypes drop by drop , 2016, Nature Biotechnology.

[25]  D. Ashlock,et al.  Construction of Optimal Edit Metric Codes , 2006, 2006 IEEE Information Theory Workshop - ITW '06 Chengdu.

[26]  Yaniv Erlich,et al.  DNA Fountain enables a robust and efficient storage architecture , 2016, Science.

[27]  G. Church,et al.  Large-scale de novo DNA synthesis: technologies and applications , 2014, Nature Methods.

[28]  S. Teichmann,et al.  A practical guide to single-cell RNA-sequencing for biomedical research and clinical applications , 2017, Genome Medicine.

[29]  Serafim Batzoglou,et al.  Genome-wide reconstruction of complex structural variants using read clouds , 2016, Nature Methods.

[30]  Michael Zuker,et al.  UNAFold: software for nucleic acid folding and hybridization. , 2008, Methods in molecular biology.