Capacity-Approaching Constrained Codes With Error Correction for DNA-Based Data Storage

We propose coding techniques that limit the length of homopolymers runs, ensure the GC-content constraint, and are capable of correcting a single edit error in strands of nucleotides in DNA-based data storage systems. In particular, for given $\ell, {\epsilon} > 0$, we propose simple and efficient encoders/decoders that transform binary sequences into DNA base sequences (codewords), namely sequences of the symbols A, T, C and G, that satisfy the following properties: (i) Runlength constraint: the maximum homopolymer run in each codeword is at most $\ell$, (ii) GC-content constraint: the GC-content of each codeword is within $[0.5-{\epsilon}, 0.5+{\epsilon}]$, (iii) Error-correction: each codeword is capable of correcting a single deletion, or single insertion, or single substitution error. For practical values of $\ell$ and ${\epsilon}$, we show that our encoders achieve much higher rates than existing results in the literature and approach the capacity. Our methods have low encoding/decoding complexity and limited error propagation.

[1]  Eitan Yaakobi,et al.  Codes Correcting a Burst of Deletions or Insertions , 2016, IEEE Transactions on Information Theory.

[2]  Kui Cai,et al.  Design of Capacity-Approaching Constrained Codes for DNA-Based Storage Systems , 2018, IEEE Communications Letters.

[3]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[4]  Adriaan J. de Lind van Wijngaarden,et al.  Construction of Maximum Run-Length Limited Codes Using Sequence Replacement Techniques , 2010, IEEE Journal on Selected Areas in Communications.

[5]  Olgica Milenkovic,et al.  Portable and Error-Free DNA-Based Data Storage , 2016, Scientific Reports.

[6]  Noga Alon,et al.  Balancing sets of vectors , 1988, IEEE Trans. Inf. Theory.

[7]  M. Médard,et al.  Repeat-Free Codes , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[8]  G. Tenengolts,et al.  Nonbinary codes, correcting single deletion or insertion , 1984, IEEE Trans. Inf. Theory.

[9]  Ewan Birney,et al.  Towards practical, high-capacity, low-maintenance information storage in synthesized DNA , 2013, Nature.

[10]  Donald E. Knuth,et al.  Efficient balanced codes , 1986, IEEE Trans. Inf. Theory.

[11]  Reinhard Heckel,et al.  A Characterization of the DNA Data Storage Channel , 2018, Scientific Reports.

[12]  G. Church,et al.  Next-Generation Digital Information Storage in DNA , 2012, Science.

[13]  Yaniv Erlich,et al.  DNA Fountain enables a robust and efficient storage architecture , 2016, Science.

[14]  Yeow Meng Chee,et al.  Optimal Codes Correcting a Single Indel / Edit for DNA-Based Data Storage , 2019, ArXiv.

[15]  Eitan Yaakobi,et al.  Codes in the Damerau Distance for Deletion and Adjacent Transposition Correction , 2018, IEEE Transactions on Information Theory.

[16]  M. Frank-Kamenetskii,et al.  Base-stacking and base-pairing contributions into thermal stability of the DNA double helix , 2006, Nucleic acids research.

[17]  Chau Yuen,et al.  Codes With Run-Length and GC-Content Constraints for DNA-Based Data Storage , 2018, IEEE Communications Letters.

[18]  Bella Bose,et al.  Design of some new Balanced Codes , 1993, Proceedings. IEEE International Symposium on Information Theory.

[19]  Wentu Song,et al.  DNA Codes with Run-Length Limitation and Knuth-Like Balancing of the GC Contents , 2019 .

[20]  J. Pieter M. Schalkwijk,et al.  An algorithm for source coding , 1972, IEEE Trans. Inf. Theory.

[21]  N. Lennon,et al.  Characterizing and measuring bias in sequence data , 2013, Genome Biology.