DNA Fountain enables a robust and efficient storage architecture

A reliable and efficient DNA storage architecture DNA has the potential to provide large-capacity information storage. However, current methods have only been able to use a fraction of the theoretical maximum. Erlich and Zielinski present a method, DNA Fountain, which approaches the theoretical maximum for information stored per nucleotide. They demonstrated efficient encoding of information—including a full computer operating system—into DNA that could be retrieved at scale after multiple rounds of polymerase chain reaction. Science, this issue p. 950 A resilient DNA storage strategy enables near-maximal information content per nucleotide. DNA is an attractive medium to store digital information. Here we report a storage strategy, called DNA Fountain, that is highly robust and approaches the information capacity per nucleotide. Using our approach, we stored a full computer operating system, movie, and other files with a total of 2.14 × 106 bytes in DNA oligonucleotides and perfectly retrieved the information from a sequencing coverage equivalent to a single tile of Illumina sequencing. We also tested a process that can allow 2.18 × 1015 retrievals using the original DNA sample and were able to perfectly decode the data. Finally, we explored the limit of our architecture in terms of bytes per molecule and obtained a perfect retrieval from a density of 215 petabytes per gram of DNA, orders of magnitude higher than previous reports.

[1]  Luis Ceze,et al.  A DNA-Based Archival Storage System , 2016, ASPLOS.

[2]  Michael Luby,et al.  LT codes , 2002, The 43rd Annual IEEE Symposium on Foundations of Computer Science, 2002. Proceedings..

[3]  Robert N Grass,et al.  Robust chemical preservation of digital information on DNA in silica with error-correcting codes. , 2015, Angewandte Chemie.

[4]  Xiao-Ming Chen,et al.  Forward Error Correction for DNA Data Storage , 2016, ICCS.

[5]  M. Somoza,et al.  Efficiency, error and yield in light-directed maskless synthesis of DNA microarrays , 2011, Journal of nanobiotechnology.

[6]  Sriram Kosuri,et al.  Scalable gene synthesis by selective amplification of DNA pools from high-fidelity microchips , 2010, Nature Biotechnology.

[7]  Jian Ma,et al.  A Rewritable, Random-Access DNA-Based Storage System , 2015, Scientific Reports.

[8]  Claude E. Shannon,et al.  The mathematical theory of communication , 1950 .

[9]  B. Faircloth,et al.  Not All Sequence Tags Are Created Equal: Designing and Validating Sequence Identification Tags Robust to Indels , 2012, PloS one.

[10]  Yuanyuan Zhou,et al.  Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems , 2016, ASPLOS.

[11]  C. DeLisi,et al.  Phenotypic connections in surprising places , 2010, Genome Biology.

[12]  K. D. Ling On geometric distributions of order (k1,…,km) , 1990 .

[13]  Jian Ma,et al.  DNA-Based Storage: Trends and Methods , 2015, IEEE Transactions on Molecular, Biological and Multi-Scale Communications.

[14]  L. Christophorou Science , 2018, Emerging Dynamics: Science, Energy, Society and Values.

[15]  David R. Kelley,et al.  Quake: quality-aware detection and correction of sequencing errors , 2010, Genome Biology.

[16]  W. Y. Wendy Lou,et al.  Waiting Time Distributions of Simple and Compound Patterns in a Sequence of r-th Order Markov Dependent Multi-state Trials , 2006 .

[17]  G. Hannon,et al.  DNA Sudoku--harnessing high-throughput sequencing for multiplexed specimen analysis. , 2009, Genome research.

[18]  Jiajie Zhang,et al.  PEAR: a fast and accurate Illumina Paired-End reAd mergeR , 2013, Bioinform..

[19]  R. Rosenfeld Nature , 2009, Otolaryngology--head and neck surgery : official journal of American Academy of Otolaryngology-Head and Neck Surgery.

[20]  Nilanjan Chatterjee,et al.  Efficient study design for next generation sequencing , 2011, Genetic epidemiology.

[21]  G. Church,et al.  Next-Generation Digital Information Storage in DNA , 2012, Science.

[22]  Jay Shendure,et al.  Accurate gene synthesis with tag-directed retrieval of sequence-verified DNA molecules , 2012, Nature Methods.

[23]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[24]  Acknowledgments , 2006, Molecular and Cellular Endocrinology.

[25]  Kateryna D. Makova,et al.  Distinct Mutational Behaviors Differentiate Short Tandem Repeats from Microsatellites in the Human Genome , 2012, Genome biology and evolution.

[26]  William Feller,et al.  An Introduction to Probability Theory and Its Applications , 1951 .

[27]  Adam H. Marblestone,et al.  Gene Assembly from Chip‐Synthesized Oligonucleotides , 2012, Current protocols in chemical biology.

[28]  Ewan Birney,et al.  Towards practical, high-capacity, low-maintenance information storage in synthesized DNA , 2013, Nature.

[29]  C Bancroft,et al.  Long-Term Storage of Information in DNA , 2001, Science.

[30]  Michael Mitzenmacher,et al.  A digital fountain approach to asynchronous reliable multicast , 2002, IEEE J. Sel. Areas Commun..

[31]  M. R. Wallace MOLECULAR CYBERNETICS: THE NEXT STEP? , 1978 .

[32]  N. Lennon,et al.  Characterizing and measuring bias in sequence data , 2013, Genome Biology.

[33]  G. Church,et al.  Large-scale de novo DNA synthesis: technologies and applications , 2014, Nature Methods.

[34]  Andreas N. Philippou,et al.  A generalized geometric distribution and some of its properties , 1983 .

[35]  Borko Furht,et al.  Handbook of Mobile Broadcasting: DVB-H, DMB, ISDB-T, AND MEDIAFLO , 2008 .

[36]  L. Goddard Information Theory , 1962, Nature.