Constrained Coding with Error Control for DNA-Based Data Storage

In this paper, we first propose coding techniques for DNA-based data storage which account the maximum homopolymer runlength and the GC-content. In particular, for arbitrary ℓ,ϵ>0, we propose simple and efficient (ℓ,ϵ)-constrained encoders that transform binary sequences into DNA base sequences (codewords), that satisfy the following properties:• Runlength constraint: the maximum homopolymer run in each codeword is at most ℓ,• GC-content constraint: the GC-content of each codeword is within [0.5−ϵ, 0.5+ϵ].For practical values of ℓ and ϵ, our codes achieve higher rates than the existing results in the literature. We further design efficient (ℓ,ϵ)-constrained codes with error-correction capability. Specifically, the designed codes satisfy the runlength constraint, the GC-content constraint, and can correct a single edit (i.e. a single deletion, insertion, or substitution) and its variants. To the best of our knowledge, no such codes are constructed prior to this work.

[1]  Kees A. Schouhamer Immink,et al.  Efficient Balanced and Maximum Homopolymer-Run Restricted Block Codes for DNA-Based Data Storage , 2019, IEEE Communications Letters.

[2]  M. Frank-Kamenetskii,et al.  Base-stacking and base-pairing contributions into thermal stability of the DNA double helix , 2006, Nucleic acids research.

[3]  Ewan Birney,et al.  Towards practical, high-capacity, low-maintenance information storage in synthesized DNA , 2013, Nature.

[4]  Olgica Milenkovic,et al.  Portable and Error-Free DNA-Based Data Storage , 2016, Scientific Reports.

[5]  Donald E. Knuth,et al.  Efficient balanced codes , 1986, IEEE Trans. Inf. Theory.

[6]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[7]  Kees A. Schouhamer Immink,et al.  Constant weight codes: An approach based on Knuth's balancing method , 2011, 2011 IEEE International Symposium on Information Theory Proceedings.

[8]  Tuan Thanh Nguyen,et al.  Binary Subblock Energy-Constrained Codes: Knuth’s Balancing and Sequence Replacement Techniques , 2020, 2020 IEEE International Symposium on Information Theory (ISIT).

[9]  Kui Cai,et al.  Design of Capacity-Approaching Constrained Codes for DNA-Based Storage Systems , 2018, IEEE Communications Letters.

[10]  Eitan Yaakobi,et al.  Codes in the Damerau Distance for Deletion and Adjacent Transposition Correction , 2018, IEEE Transactions on Information Theory.

[11]  Adriaan J. de Lind van Wijngaarden,et al.  Construction of Maximum Run-Length Limited Codes Using Sequence Replacement Techniques , 2010, IEEE Journal on Selected Areas in Communications.

[12]  Noga Alon,et al.  Balancing sets of vectors , 1988, IEEE Trans. Inf. Theory.

[13]  Kui Cai,et al.  Properties and Constructions of Constrained Codes for DNA-Based Data Storage , 2018, IEEE Access.

[14]  G. Church,et al.  Next-Generation Digital Information Storage in DNA , 2012, Science.

[15]  Cyrus Rashtchian,et al.  Random access in large-scale DNA data storage , 2018, Nature Biotechnology.

[16]  Eitan Yaakobi,et al.  Codes Correcting a Burst of Deletions or Insertions , 2016, IEEE Transactions on Information Theory.

[17]  Chau Yuen,et al.  Codes With Run-Length and GC-Content Constraints for DNA-Based Data Storage , 2018, IEEE Communications Letters.

[18]  Wentu Song,et al.  DNA Codes with Run-Length Limitation and Knuth-Like Balancing of the GC Contents , 2019 .

[19]  Yeow Meng Chee,et al.  Optimal Codes Correcting a Single Indel / Edit for DNA-Based Data Storage , 2019, ArXiv.

[20]  G. Tenengolts,et al.  Nonbinary codes, correcting single deletion or insertion , 1984, IEEE Trans. Inf. Theory.

[21]  Yaniv Erlich,et al.  DNA Fountain enables a robust and efficient storage architecture , 2016, Science.

[22]  Han Mao Kiah,et al.  Capacity-Approaching Constrained Codes With Error Correction for DNA-Based Data Storage , 2020, IEEE Transactions on Information Theory.

[23]  N. Lennon,et al.  Characterizing and measuring bias in sequence data , 2013, Genome Biology.

[24]  Reinhard Heckel,et al.  A Characterization of the DNA Data Storage Channel , 2018, Scientific Reports.