HEDGES error-correcting code for DNA storage corrects indels and allows sequence constraints

Significance This paper constructs an error-correcting code for the {A,C,G,T} alphabet of DNA. By contrast with previous work, the code corrects insertions and deletions directly, in a single strand of DNA, without the need for multiple alignment of strands. This code, when coupled to a standard outer code, can achieve error-free storage of petabyte-scale data even when ∼10% of all nucleotides are erroneous. Synthetic DNA is rapidly emerging as a durable, high-density information storage platform. A major challenge for DNA-based information encoding strategies is the high rate of errors that arise during DNA synthesis and sequencing. Here, we describe the HEDGES (Hash Encoded, Decoded by Greedy Exhaustive Search) error-correcting code that repairs all three basic types of DNA errors: insertions, deletions, and substitutions. HEDGES also converts unresolved or compound errors into substitutions, restoring synchronization for correction via a standard Reed–Solomon outer code that is interleaved across strands. Moreover, HEDGES can incorporate a broad class of user-defined sequence constraints, such as avoiding excess repeats, or too high or too low windowed guanine–cytosine (GC) content. We test our code both via in silico simulations and with synthesized DNA. From its measured performance, we develop a statistical model applicable to much larger datasets. Predicted performance indicates the possibility of error-free recovery of petabyte- and exabyte-scale data from DNA degraded with as much as 10% errors. As the cost of DNA synthesis and sequencing continues to drop, we anticipate that HEDGES will find applications in large-scale error-free information encoding.

[1]  Nils J. Nilsson,et al.  A Formal Basis for the Heuristic Determination of Minimum Cost Paths , 1968, IEEE Trans. Syst. Sci. Cybern..

[2]  Ilya J. Finkelstein,et al.  Indel-correcting DNA barcodes for high-throughput sequencing , 2018, Proceedings of the National Academy of Sciences.

[3]  O. Antoine,et al.  Theory of Error-correcting Codes , 2022 .

[4]  G. Church,et al.  Next-Generation Digital Information Storage in DNA , 2012, Science.

[5]  Luis Ceze,et al.  A DNA-Based Archival Storage System , 2016, ASPLOS.

[6]  Bernhard Klar,et al.  BOUNDS ON TAIL PROBABILITIES OF DISCRETE DISTRIBUTIONS , 2000, Probability in the Engineering and Informational Sciences.

[7]  T. Moon Error Correction Coding: Mathematical Methods and Algorithms , 2005 .

[8]  Ron M. Roth,et al.  Introduction to Coding Theory , 2019, Discrete Mathematics.

[9]  Cyrus Rashtchian,et al.  Random access in large-scale DNA data storage , 2018, Nature Biotechnology.

[10]  L. Ceze,et al.  Molecular digital data storage using DNA , 2019, Nature Reviews Genetics.

[11]  Michael Mitzenmacher,et al.  A Survey of Results for Deletion Channels and Related Synchronization Channels , 2008, SWAT.

[12]  Yaniv Erlich,et al.  DNA Fountain enables a robust and efficient storage architecture , 2016, Science.

[13]  Tsunglin Liu,et al.  Effects of GC Bias in Next-Generation-Sequencing Data on De Novo Genome Assembly , 2013, PloS one.

[14]  Brent S. Pedersen,et al.  Nanopore sequencing and assembly of a human genome with ultra-long reads , 2017, Nature Biotechnology.

[15]  W. Press,et al.  Numerical Recipes: The Art of Scientific Computing , 1987 .

[16]  Stephen B. Wicker,et al.  An Introduction to ReedSolomon Codes , 1994 .

[17]  Lindsay N. Childs An Introduction to Reed–Solomon Codes , 2019 .

[18]  Olgica Milenkovic,et al.  Portable and Error-Free DNA-Based Data Storage , 2016, Scientific Reports.

[19]  Robert N Grass,et al.  Robust chemical preservation of digital information on DNA in silica with error-correcting codes. , 2015, Angewandte Chemie.

[20]  Ewan Birney,et al.  Towards practical, high-capacity, low-maintenance information storage in synthesized DNA , 2013, Nature.

[21]  Erold W. Hinds,et al.  Error-correction coding , 1996 .

[22]  Shu Lin,et al.  Error Control Coding , 2004 .