Batch Optimization for DNA Synthesis

Large pools of synthetic DNA molecules have been recently used to reliably store significant volumes of digital data. While DNA as a storage medium has enormous potential because of its high storage density, its practical use is currently severely limited because of the high cost and low throughput of available DNA synthesis technologies. We study the role of batch optimization in reducing the cost of large scale DNA synthesis, which translates to the following algorithmic task. Given a large pool $S$ of random quaternary strings of fixed length, partition $S$ into batches in a way that minimizes the sum of the lengths of the shortest common supersequences across batches. We introduce two ideas for batch optimization that both improve (in different ways) upon a naive baseline: (1) using both (ACGT)* and its reverse (TGCA)* as reference strands, and batching appropriately, and (2) batching via the quantiles of an appropriate ordering of the strands. We also prove asymptotically matching lower bounds on the cost of DNA synthesis, showing that one cannot improve upon these two ideas. Our results uncover a surprising separation between two cases that naturally arise in the context of DNA data storage: the asymptotic cost savings of batch optimization are significantly greater in the case where strings in $S$ do not contain repeats of the same character (homopolymers), as compared to the case where strings in $S$ are unconstrained. A full version of this paper is accessible at: https://arxiv.org/abs/2011.14532

[1]  Miklós Z. Rácz,et al.  Batch Optimization for DNA Synthesis , 2022, IEEE Transactions on Information Theory.

[2]  B. Bukh,et al.  Periodic words, common subsequences and frogs , 2019, The Annals of Applied Probability.

[3]  chip technology , 2022, The Fairchild Books Dictionary of Fashion.

[4]  Andreas Lenz,et al.  Coding for Efficient DNA Synthesis , 2020, 2020 IEEE International Symposium on Information Theory (ISIT).

[5]  Jehoshua Bruck,et al.  Coding for Optimized Writing Rate in DNA Storage , 2020, 2020 IEEE International Symposium on Information Theory (ISIT).

[6]  G. Church,et al.  Photon-directed multiplexed enzymatic DNA synthesis for molecular digital data storage , 2020, Nature Communications.

[7]  Shubham Chandak,et al.  Overcoming High Nanopore Basecaller Error Rates for DNA Storage Via Basecaller-Decoder Integration and Convolutional Codes , 2019, bioRxiv.

[8]  Reinhard Heckel,et al.  Reading and writing digital data in DNA , 2019, Nature Protocols.

[9]  Leon Anavy,et al.  Data storage in DNA with fewer synthesis cycles using composite DNA letters , 2019, Nature Biotechnology.

[10]  Naveen Goela,et al.  Terminator-free template-independent enzymatic DNA synthesis for digital information storage , 2019, Nature Communications.

[11]  L. Ceze,et al.  Molecular digital data storage using DNA , 2019, Nature Reviews Genetics.

[12]  Christopher N. Takahashi,et al.  Random access in large-scale DNA data storage , 2018, Nature Biotechnology.

[13]  Yuval Peres,et al.  Subpolynomial trace reconstruction for random strings and arbitrary deletion probability , 2018, COLT.

[14]  Yuval Peres,et al.  Average-Case Reconstruction for the Deletion Channel: Subpolynomially Many Traces Suffice , 2017, 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS).

[15]  G. Church,et al.  CRISPR-Cas encoding of a digital movie into the genomes of a population of living bacteria , 2017, Nature.

[16]  Yaniv Erlich,et al.  DNA Fountain enables a robust and efficient storage architecture , 2016, Science.

[17]  Cyrus Rashtchian,et al.  Clustering Billions of Reads for DNA Data Storage , 2017, NIPS.

[18]  Olgica Milenkovic,et al.  Portable and Error-Free DNA-Based Data Storage , 2016, Scientific Reports.

[19]  Luis Ceze,et al.  A DNA-Based Archival Storage System , 2016, ASPLOS.

[20]  Jian Ma,et al.  DNA-Based Storage: Trends and Methods , 2015, IEEE Transactions on Molecular, Biological and Multi-Scale Communications.

[21]  V. Kamakoti,et al.  A Review of Algorithms for Border Length Minimization Problem , 2014 .

[22]  G. Church,et al.  Large-scale de novo DNA synthesis: technologies and applications , 2014, Nature Methods.

[23]  Gábor Lugosi,et al.  Concentration Inequalities - A Nonasymptotic Theory of Independence , 2013, Concentration Inequalities.

[24]  Ewan Birney,et al.  Towards practical, high-capacity, low-maintenance information storage in synthesized DNA , 2013, Nature.

[25]  Katrin Paeschke,et al.  DNA secondary structures: stability and function of G-quadruplex structures , 2012, Nature Reviews Genetics.

[26]  G. Church,et al.  Next-Generation Digital Information Storage in DNA , 2012, Science.

[27]  Hon Wai Leong,et al.  The multiple sequence sets: problem and heuristic algorithms , 2011, J. Comb. Optim..

[28]  Sanguthevar Rajasekaran,et al.  Parallel Algorithms for DNA Probe Placement on Small Oligonucleotide Arrays , 2011, ArXiv.

[29]  David Z. Pan,et al.  DNA Microarray placement for improved performance and reliability , 2010, Proceedings of 2010 International Symposium on VLSI Design, Automation and Test.

[30]  C. Houdr'e,et al.  Closeness to the Diagonal for Longest Common Subsequences , 2009, 0911.2031.

[31]  George S. Lueker,et al.  Improved bounds on the average length of longest common subsequences , 2003, JACM.

[32]  Krishnamurthy Viswanathan,et al.  Improved string reconstruction over insertion-deletion channels , 2008, SODA '08.

[33]  Rina Panigrahy,et al.  Trace reconstruction with constant deletion probability and related results , 2008, SODA '08.

[34]  Sven Rahmann,et al.  Subsequence Combinatorics and Applications to Microarray Production, DNA Sequencing and Chaining Algorithms , 2006, CPM.

[35]  Olgica Milenkovic,et al.  Error and Quality Control Coding for DNA Microarrays , 2006 .

[36]  Hon Wai Leong,et al.  The distribution and deposition algorithm for multiple oligo nucleotide arrays. , 2006, Genome informatics. International Conference on Genome Informatics.

[37]  Andrew B. Kahng,et al.  Scalable Heuristics for Design of DNA Probe Arrays , 2004, J. Comput. Biol..

[38]  Sampath Kannan,et al.  Reconstructing strings from random traces , 2004, SODA '04.

[39]  Jirí Matousek,et al.  Expected Length of the Longest Common Subsequence for Large Alphabets , 2003, LATIN.

[40]  Sven Rahmann The shortest common supersequence problem in a microarray production setting , 2003, ECCB.

[41]  Andrew B. Kahng,et al.  Border Length Minimization in DNA Array Design , 2002, WABI.

[42]  Charles J. Colbourn,et al.  Construction of optimal quality control for oligo arrays , 2002, Bioinform..

[43]  Pavel A Pevzner,et al.  Combinatorial algorithms for design of DNA arrays. , 2002, Advances in biochemical engineering/biotechnology.

[44]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[45]  Martin Tompa,et al.  Quality Control in Manufacturing Oligo Arrays: A Combinatorial Design Approach , 2001, Pacific Symposium on Biocomputing.

[46]  Earl Hubbell,et al.  Fidelity Probes for DNA Arrays , 1999, ISMB.

[47]  Tao Jiang,et al.  On the Approximation of Shortest Common Supersequences and Longest Common Subsequences , 1994, SIAM J. Comput..

[48]  Mike Paterson,et al.  Upper Bounds for the Expected Length of a Longest Common Subsequence of Two Binary Sequences , 1994, Random Struct. Algorithms.

[49]  P. Massart The Tight Constant in the Dvoretzky-Kiefer-Wolfowitz Inequality , 1990 .

[50]  Esko Ukkonen,et al.  The Shortest Common Supersequence Problem over Binary Alphabet is NP-Complete , 1981, Theor. Comput. Sci..

[51]  V. Chvátal,et al.  Longest common subsequences of two random sequences , 1975, Advances in Applied Probability.

[52]  J. Kiefer,et al.  Asymptotic Minimax Character of the Sample Distribution Function and of the Classical Multinomial Estimator , 1956 .