Short Barcodes for Next Generation Sequencing

We consider the design and evaluation of short barcodes, with a length between six and eight nucleotides, used for parallel sequencing on platforms where substitution errors dominate. Such codes should have not only good error correction properties but also the code words should fulfil certain biological constraints (experimental parameters). We compare published barcodes with codes obtained by two new constructions methods, one based on the currently best known linear codes and a simple randomized construction method. The evaluation done is with respect to the error correction capabilities, barcode size and their experimental parameters and fundamental bounds on the code size and their distance properties. We provide a list of codes for lengths between six and eight nucleotides, where for length eight, two substitution errors can be corrected. In fact, no code with larger minimum distance can exist.

[1]  D. A. Bell,et al.  Information Theory and Reliable Communication , 1969 .

[2]  T. Kunkel,et al.  Fidelity of DNA synthesis by the Thermus aquaticus DNA polymerase. , 1988, Biochemistry.

[3]  John J. Cannon,et al.  The Magma Algebra System I: The User Language , 1997, J. Symb. Comput..

[4]  Martin Bossert,et al.  Channel Coding for Telecommunications , 1999 .

[5]  Yong Wang,et al.  Genome Sequencing in Open Microfabricated High Density Picoliter Reactors , 2005 .

[6]  James R. Knight,et al.  Genome sequencing in microfabricated high-density picolitre reactors , 2005, Nature.

[7]  Wieb Bosma,et al.  Discovering mathematics with Magma : reducing the abstract to the concrete , 2006 .

[8]  Markus Grassl,et al.  Searching for linear codes with large minimum distance , 2006 .

[9]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[10]  Juliane C. Dohm,et al.  Substantial biases in ultra-short read data sets from high-throughput DNA sequencing , 2008, Nucleic acids research.

[11]  Nancy F. Hansen,et al.  Accurate Whole Human Genome Sequencing using Reversible Terminator Chemistry , 2008, Nature.

[12]  R. Knight,et al.  Error-correcting barcoded primers for pyrosequencing hundreds of samples in multiplex , 2008, Nature Methods.

[13]  Yun S. Song,et al.  BayesCall: A model-based base-calling algorithm for high-throughput short-read sequencing. , 2009, Genome research.

[14]  Daniel N. Frank,et al.  BARCRAWL and BARTAB: software tools for the design and implementation of barcoded primers for highly multiplexed DNA sequencing , 2009, BMC Bioinformatics.

[15]  Margaret C. Linak,et al.  Sequence-specific error profile of Illumina sequencers , 2011, Nucleic acids research.

[16]  T. Thomas,et al.  GemSIM: general, error-model based simulator of next-generation sequencing data , 2012, BMC Genomics.

[17]  Bane Vasic,et al.  Barcodes for DNA sequencing with guaranteed error correction capability , 2011 .

[18]  Juliane C. Dohm,et al.  Evaluation of genomic high-throughput sequencing data generated on Illumina HiSeq and Genome Analyzer systems , 2011, Genome Biology.

[19]  Leonid V. Bystrykh,et al.  Generalized DNA Barcode Design Based on Hamming Codes , 2012, PloS one.

[20]  O. Antoine,et al.  Theory of Error-correcting Codes , 2022 .