Coding for Efficient DNA Synthesis

For DNA data storage to become a feasible technology, all aspects of the encoding and decoding pipeline must be optimized. Writing the data into DNA, which is known as DNA synthesis, is currently the most costly part of existing storage systems. As a step toward more efficient synthesis, we study the design of codes that minimize the time and number of required materials needed to produce the DNA strands. We consider a popular synthesis process that builds many strands in parallel in a step-by-step fashion using a fixed supersequence S. The machine iterates through S one nucleotide at a time, and in each cycle, it adds the next nucleotide to a subset of the strands. The synthesis time is determined by the length of S. We show that by introducing redundancy to the synthesized strands, we can significantly decrease the number of synthesis cycles. We derive the maximum amount of information per synthesis cycle assuming S is an arbitrary periodic sequence. To prove our results, we exhibit new connections to cost-constrained codes.

[1]  M. M. al-Rifaie,et al.  Introduction to Coding , 2020, The Art of Coding.

[2]  G. Church,et al.  Large-scale de novo DNA synthesis: technologies and applications , 2014, Nature Methods.

[3]  Cyrus Rashtchian,et al.  Clustering Billions of Reads for DNA Data Storage , 2017, NIPS.

[4]  Robert N Grass,et al.  Robust chemical preservation of digital information on DNA in silica with error-correcting codes. , 2015, Angewandte Chemie.

[5]  Leon Anavy,et al.  Data storage in DNA with fewer synthesis cycles using composite DNA letters , 2019, Nature Biotechnology.

[6]  M. Caruthers,et al.  The Chemical Synthesis of DNA/RNA: Our Gift to Science , 2012, The Journal of Biological Chemistry.

[7]  Ewan Birney,et al.  Towards practical, high-capacity, low-maintenance information storage in synthesized DNA , 2013, Nature.

[8]  Olgica Milenkovic,et al.  Portable and Error-Free DNA-Based Data Storage , 2016 .

[9]  Luis Ceze,et al.  Puddle: A Dynamic, Error-Correcting, Full-Stack Microfluidics Platform , 2019, ASPLOS.

[10]  Paul H. Siegel,et al.  Rate-Constrained Shaping Codes for Structured Sources , 2020, IEEE Transactions on Information Theory.

[11]  Michael Langberg,et al.  A Characterization of the Number of Subsequences Obtained via the Deletion Channel , 2015, IEEE Transactions on Information Theory.

[12]  Mireille Régnier,et al.  Tight Bounds on the Number of String Subsequences DANIEL S , 2000 .

[13]  Cyrus Rashtchian,et al.  Random access in large-scale DNA data storage , 2018, Nature Biotechnology.

[14]  L. Ceze,et al.  Molecular digital data storage using DNA , 2019, Nature Reviews Genetics.

[15]  Andreas Lenz,et al.  Coding over Sets for DNA Storage , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[16]  G. Church,et al.  Next-Generation Digital Information Storage in DNA , 2012, Science.

[17]  Luis Ceze,et al.  High density DNA data storage library via dehydration with digital microfluidic retrieval , 2019, Nature Communications.