Not All Sequence Tags Are Created Equal: Designing and Validating Sequence Identification Tags Robust to Indels

Ligating adapters with unique synthetic oligonucleotide sequences (sequence tags) onto individual DNA samples before massively parallel sequencing is a popular and efficient way to obtain sequence data from many individual samples. Tag sequences should be numerous and sufficiently different to ensure sequencing, replication, and oligonucleotide synthesis errors do not cause tags to be unrecoverable or confused. However, many design approaches only protect against substitution errors during sequencing and extant tag sets contain too few tag sequences. We developed an open-source software package to validate sequence tags for conformance to two distance metrics and design sequence tags robust to indel and substitution errors. We use this software package to evaluate several commercial and non-commercial sequence tag sets, design several large sets (maxcount = 7,198) of edit metric sequence tags having different lengths and degrees of error correction, and integrate a subset of these edit metric tags to polymerase chain reaction (PCR) primers and sequencing adapters. We validate a subset of these edit metric tagged PCR primers and sequencing adapters by sequencing on several platforms and subsequent comparison to commercially available alternatives. We find that several commonly used sets of sequence tags or design methodologies used to produce sequence tags do not meet the minimum expectations of their underlying distance metric, and we find that PCR primers and sequencing adapters incorporating edit metric sequence tags designed by our software package perform as well as their commercial counterparts. We suggest that researchers evaluate sequence tags prior to use or evaluate tags that they have been using. The sequence tag sets we design improve on extant sets because they are large, valid across the set, and robust to the suite of substitution, insertion, and deletion errors affecting massively parallel sequencing workflows on all currently used platforms.

[1]  Richard W. Hamming,et al.  Error detecting and error correcting codes , 1950 .

[2]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[3]  E. Pahlich,et al.  A rapid DNA isolation procedure for small quantities of fresh leaf tissue , 1980 .

[4]  N. J. A. Sloane,et al.  Lexicographic codes: Error-correcting codes from game theory , 1986, IEEE Trans. Inf. Theory.

[5]  S E Humphries,et al.  Errors in the polymerase chain reaction. , 1988, Nucleic acids research.

[6]  T. Kunkel,et al.  Fidelity of DNA synthesis by the Thermus aquaticus DNA polymerase. , 1988, Biochemistry.

[7]  T. Kunkel,et al.  DNA polymerase fidelity and the polymerase chain reaction. , 1991, PCR methods and applications.

[8]  Graham A. Stephen String Searching Algorithms , 1994, Lecture Notes Series on Computing.

[9]  S. Agrawal,et al.  Sequence identity of the n-1 product of a synthetic oligonucleotide. , 1995, Nucleic acids research.

[10]  G A Buck,et al.  Multi-facility survey of oligonucleotide synthesis and an examination of the performance of unpurified primers in automated DNA sequencing. , 1996, BioTechniques.

[11]  J. Carpten,et al.  Modulation of non-templated nucleotide addition by Taq DNA polymerase: primer modifications that facilitate genotyping. , 1996, BioTechniques.

[12]  Dan Gusfield Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[13]  E. Southern,et al.  Fidelity of DNA ligation: a novel experimental approach based on the polymerisation of libraries of oligonucleotides. , 1998, Nucleic acids research.

[14]  D. L. Cole,et al.  Analysis of internal (n-1)mer deletion sequences in synthetic oligodeoxyribonucleotides by hybridization to an immobilized probe array. , 1999, Nucleic Acids Research.

[15]  M. Gilar Analysis and purification of synthetic oligonucleotides by reversed-phase high-performance liquid chromatography with photodiode array and mass spectrometry detection. , 2001, Analytical biochemistry.

[16]  Daniel A. Ashlock,et al.  Greedy closure evolutionary algorithms , 2002, Proceedings of the 2002 Congress on Evolutionary Computation. CEC'02 (Cat. No.02TH8600).

[17]  Feng Liu,et al.  DNA Sequence-Based „Bar Codes” for Tracking the Origins of Expressed Sequence Tags from a Maize cDNA Library Constructed Using Multiple mRNA Sources1 , 2003, Plant Physiology.

[18]  Richard Cole,et al.  Dictionary matching and indexing with errors and don't cares , 2004, STOC '04.

[19]  James R. Knight,et al.  Genome sequencing in microfabricated high-density picolitre reactors , 2005, Nature.

[20]  W. Kress,et al.  Medical Archives and Manuscripts News, 2005 , 2006, Medical History.

[21]  Jonathan P. Bollback,et al.  The Use of Coded PCR Primers Enables High-Throughput Sequencing of Multiple Homolog Amplification Products by 454 Parallel Sequencing , 2007, PloS one.

[22]  Susan M. Huse,et al.  Accuracy and quality of massively parallel DNA pyrosequencing , 2007, Genome Biology.

[23]  U. Stenzel,et al.  Targeted high-throughput sequencing of tagged nucleic acid samples , 2007, Nucleic acids research.

[24]  Juliane C. Dohm,et al.  Substantial biases in ultra-short read data sets from high-throughput DNA sequencing , 2008, Nucleic acids research.

[25]  Hanlee P. Ji,et al.  Next-generation DNA sequencing , 2008, Nature Biotechnology.

[26]  U. Stenzel,et al.  Parallel tagged sequencing on the 454 platform , 2008, Nature Protocols.

[27]  Gabor T. Marth,et al.  Whole-genome sequencing and variant discovery in C. elegans , 2008, Nature Methods.

[28]  Nancy F. Hansen,et al.  Accurate Whole Human Genome Sequencing using Reversible Terminator Chemistry , 2008, Nature.

[29]  R. Knight,et al.  Error-correcting barcoded primers for pyrosequencing hundreds of samples in multiplex , 2008, Nature Methods.

[30]  Timothy B. Stockwell,et al.  Evaluation of next generation sequencing platforms for population targeted sequencing studies , 2009, Genome Biology.

[31]  Francisco M. De La Vega,et al.  Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding. , 2009, Genome research.

[32]  G. Hannon,et al.  DNA Sudoku--harnessing high-throughput sequencing for multiplexed specimen analysis. , 2009, Genome research.

[33]  Rolando Perez,et al.  Plant DNA barcodes and a community phylogeny of a tropical forest dynamics plot in Panama , 2009, Proceedings of the National Academy of Sciences.

[34]  Z. Ning,et al.  Amplification-free Illumina sequencing-library preparation facilitates improved mapping and assembly of GC-biased genomes , 2009, Nature Methods.

[35]  Martin Kircher,et al.  Improved base calling for the Illumina Genome Analyzer using machine learning strategies , 2009, Genome Biology.

[36]  Daniel N. Frank,et al.  BARCRAWL and BARTAB: software tools for the design and implementation of barcoded primers for highly multiplexed DNA sequencing , 2009, BMC Bioinformatics.

[37]  Sheridan K. Houghten,et al.  DNA error correcting codes: No crossover. , 2009, 2009 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology.

[38]  K. Jones,et al.  Massively parallel 454 sequencing indicates hyperdiverse fungal communities in temperate Quercus macrocarpa phyllosphere. , 2009, The New phytologist.

[39]  Michael A. Thomas,et al.  Complete mitochondrial genome phylogeographic analysis of killer whales (Orcinus orca) indicates multiple species. , 2010, Genome research.

[40]  Matthias Meyer,et al.  Illumina sequencing library preparation for highly multiplexed target capture and sequencing. , 2010, Cold Spring Harbor protocols.

[41]  Dennis C. Friedrich,et al.  A scalable, fully automated process for construction of sequence-ready human exome targeted capture libraries , 2011, Genome Biology.

[42]  I. Haviv,et al.  Combining target enrichment with barcode multiplexing for high throughput SNP discovery , 2010, BMC Genomics.

[43]  Maliha Aziz,et al.  The Effects of Circumcision on the Penis Microbiome , 2010, PloS one.

[44]  T. Fennell,et al.  Targeted Exon Sequencing by In‐Solution Hybrid Selection , 2010, Current protocols in human genetics.

[45]  Andrew C. Adey,et al.  Rapid, low-input, low-bias construction of shotgun fragment libraries by high-density in vitro transposition , 2010, Genome Biology.

[46]  R. Sachidanandam,et al.  Identification and remediation of biases in the activity of RNA ligases in small-RNA deep sequencing , 2011, Nucleic acids research.

[47]  James H. Bullard,et al.  The origin of the Haitian cholera outbreak strain. , 2011, The New England journal of medicine.

[48]  Emese Meglécz,et al.  Accuracy and quality assessment of 454 GS-FLX Titanium pyrosequencing , 2011, BMC Genomics.

[49]  T. Glenn Field guide to next‐generation DNA sequencers , 2011, Molecular ecology resources.

[50]  Michael A Quail,et al.  Optimal enzymes for amplifying sequencing libraries , 2011, Nature Methods.

[51]  S. Luo,et al.  RNA-ligase-dependent biases in miRNA representation in deep-sequenced small RNA cDNA libraries. , 2011, RNA.

[52]  Roche , 2012, Schizophrenia Research.

[53]  Leonid V. Bystrykh,et al.  Generalized DNA Barcode Design Based on Hamming Codes , 2012, PloS one.

[54]  Martin Kircher,et al.  Double indexing overcomes inaccuracies in multiplex sequencing on the Illumina platform , 2011, Nucleic acids research.

[55]  Travis C Glenn,et al.  Ultraconserved elements anchor thousands of genetic markers spanning multiple evolutionary timescales. , 2012, Systematic biology.

[56]  B. Faircloth,et al.  Primer3—new capabilities and interfaces , 2012, Nucleic acids research.