Batch Optimization for DNA Synthesis

Large pools of synthetic DNA molecules have been recently used to reliably store significant volumes of digital data. While DNA as a storage medium has enormous potential because of its high storage density, its practical use is currently severely limited because of the high cost and low throughput of available DNA synthesis technologies. We study the role of batch optimization in reducing the cost of large scale DNA synthesis, which translates to the following algorithmic task. Given a large pool $\mathcal{S}$ of random quaternary strings of fixed length, partition $\mathcal{S}$ into batches in a way that minimizes the sum of the lengths of the shortest common supersequences across batches. We introduce two ideas for batch optimization that both improve (in different ways) upon a naive baseline: (1) using both $(ACGT)^{*}$ and its reverse $(TGCA)^{*}$ as reference strands, and batching appropriately, and (2) batching via the quantiles of an appropriate ordering of the strands. We also prove asymptotically matching lower bounds on the cost of DNA synthesis, showing that one cannot improve upon these two ideas. Our results uncover a surprising separation between two cases that naturally arise in the context of DNA data storage: the asymptotic cost savings of batch optimization are significantly greater in the case where strings in $\mathcal{S}$ do not contain repeats of the same character (homopolymers), as compared to the case where strings in $\mathcal{S}$ are unconstrained.

[1]  Tao Jiang,et al.  On the Approximation of Shortest Common Supersequences and Longest Common Subsequences , 1994, SIAM J. Comput..

[2]  Jehoshua Bruck,et al.  Coding for Optimized Writing Rate in DNA Storage , 2020, 2020 IEEE International Symposium on Information Theory (ISIT).

[3]  Martin Tompa,et al.  Quality Control in Manufacturing Oligo Arrays: A Combinatorial Design Approach , 2001, Pacific Symposium on Biocomputing.

[4]  Katrin Paeschke,et al.  DNA secondary structures: stability and function of G-quadruplex structures , 2012, Nature Reviews Genetics.

[5]  Mike Paterson,et al.  Upper Bounds for the Expected Length of a Longest Common Subsequence of Two Binary Sequences , 1995, Random Struct. Algorithms.

[6]  Krishnamurthy Viswanathan,et al.  Improved string reconstruction over insertion-deletion channels , 2008, SODA '08.

[7]  V. Kamakoti,et al.  A Review of Algorithms for Border Length Minimization Problem , 2014 .

[8]  Luis Ceze,et al.  A DNA-Based Archival Storage System , 2016, ASPLOS.

[9]  Andreas Lenz,et al.  Coding for Efficient DNA Synthesis , 2020, 2020 IEEE International Symposium on Information Theory (ISIT).

[10]  George S. Lueker,et al.  Improved bounds on the average length of longest common subsequences , 2003, JACM.

[11]  Cyrus Rashtchian,et al.  Random access in large-scale DNA data storage , 2018, Nature Biotechnology.

[12]  L. Ceze,et al.  Molecular digital data storage using DNA , 2019, Nature Reviews Genetics.

[13]  G. Church,et al.  Next-Generation Digital Information Storage in DNA , 2012, Science.

[14]  Sampath Kannan,et al.  Reconstructing strings from random traces , 2004, SODA '04.

[15]  Olgica Milenkovic,et al.  Error and Quality Control Coding for DNA Microarrays , 2006 .

[16]  V. Chvátal,et al.  Longest common subsequences of two random sequences , 1975, Advances in Applied Probability.

[17]  G. Church,et al.  Photon-directed multiplexed enzymatic DNA synthesis for molecular digital data storage , 2020, Nature Communications.

[18]  Olgica Milenkovic,et al.  Portable and Error-Free DNA-Based Data Storage , 2016, Scientific Reports.

[19]  C. Houdr'e,et al.  Closeness to the Diagonal for Longest Common Subsequences , 2009, 0911.2031.

[20]  Sven Rahmann The shortest common supersequence problem in a microarray production setting , 2003, ECCB.

[21]  G. Church,et al.  CRISPR-Cas encoding of a digital movie into the genomes of a population of living bacteria , 2017, Nature.

[23]  Cyrus Rashtchian,et al.  Clustering Billions of Reads for DNA Data Storage , 2017, NIPS.

[24]  Jirí Matousek,et al.  Expected Length of the Longest Common Subsequence for Large Alphabets , 2003, LATIN.

[25]  Sven Rahmann,et al.  Subsequence Combinatorics and Applications to Microarray Production, DNA Sequencing and Chaining Algorithms , 2006, CPM.

[26]  Andrew B. Kahng,et al.  Border Length Minimization in DNA Array Design , 2002, WABI.

[27]  Shubham Chandak,et al.  Overcoming High Nanopore Basecaller Error Rates for DNA Storage Via Basecaller-Decoder Integration and Convolutional Codes , 2019, bioRxiv.

[28]  Ewan Birney,et al.  Towards practical, high-capacity, low-maintenance information storage in synthesized DNA , 2013, Nature.

[29]  Yuval Peres,et al.  Average-Case Reconstruction for the Deletion Channel: Subpolynomially Many Traces Suffice , 2017, 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS).

[30]  Esko Ukkonen,et al.  The Shortest Common Supersequence Problem over Binary Alphabet is NP-Complete , 1981, Theor. Comput. Sci..

[31]  S LuekerGeorge Improved bounds on the average length of longest common subsequences , 2009 .

[32]  Hon Wai Leong,et al.  The multiple sequence sets: problem and heuristic algorithms , 2011, J. Comb. Optim..

[33]  Yaniv Erlich,et al.  DNA Fountain enables a robust and efficient storage architecture , 2016, Science.

[34]  J. Kiefer,et al.  Asymptotic Minimax Character of the Sample Distribution Function and of the Classical Multinomial Estimator , 1956 .

[35]  Charles J. Colbourn,et al.  Construction of optimal quality control for oligo arrays , 2002, Bioinform..

[36]  Sanguthevar Rajasekaran,et al.  Parallel Algorithms for DNA Probe Placement on Small Oligonucleotide Arrays , 2011, ArXiv.

[37]  Gábor Lugosi,et al.  Concentration Inequalities - A Nonasymptotic Theory of Independence , 2013, Concentration Inequalities.

[38]  David Z. Pan,et al.  DNA Microarray placement for improved performance and reliability , 2010, Proceedings of 2010 International Symposium on VLSI Design, Automation and Test.

[39]  Reinhard Heckel,et al.  Reading and writing digital data in DNA , 2019, Nature Protocols.

[40]  Earl Hubbell,et al.  Fidelity Probes for DNA Arrays , 1999, ISMB.

[41]  P. Massart The Tight Constant in the Dvoretzky-Kiefer-Wolfowitz Inequality , 1990 .

[42]  Yuval Peres,et al.  Subpolynomial trace reconstruction for random strings and arbitrary deletion probability , 2018, COLT.

[43]  Hon Wai Leong,et al.  The distribution and deposition algorithm for multiple oligo nucleotide arrays. , 2006, Genome informatics. International Conference on Genome Informatics.

[44]  Tao Jiang,et al.  On the Approximation of Shortest Common Supersequences and Longest Common Subsequences , 1995, SIAM J. Comput..

[45]  Naveen Goela,et al.  Terminator-free template-independent enzymatic DNA synthesis for digital information storage , 2019, Nature Communications.

[46]  Rina Panigrahy,et al.  Trace reconstruction with constant deletion probability and related results , 2008, SODA '08.

[47]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[48]  Jian Ma,et al.  DNA-Based Storage: Trends and Methods , 2015, IEEE Transactions on Molecular, Biological and Multi-Scale Communications.

[49]  G. Church,et al.  Large-scale de novo DNA synthesis: technologies and applications , 2014, Nature Methods.

[50]  Andrew B. Kahng,et al.  Scalable Heuristics for Design of DNA Probe Arrays , 2004, J. Comput. Biol..

[51]  Leon Anavy,et al.  Data storage in DNA with fewer synthesis cycles using composite DNA letters , 2019, Nature Biotechnology.