Toward practical high-capacity low-maintenance storage of digital information in synthesised DNA

The shift to digital systems for the creation, transmission and storage of information has led to increasing complexity in archiving, requiring active, ongoing maintenance of the digital media. DNA is an attractive target for information storage1 because of its capacity for high density information encoding, longevity under easily-achieved conditions2–4 and proven track record as an information bearer. Previous DNA-based information storage approaches have encoded only trivial amounts of information5–7 or were not amenable to scaling-up8, and used no robust errorcorrection and lacked examination of their cost-efficiency for large-scale information archival9. Here we describe a scalable method that can reliably store more information than has been handled before. We encoded computer files totalling 739 kB of hard disk storage and with an estimated Shannon information10 of 5.2 × 106 bits into a DNA code, synthesised this DNA, sequenced it and reconstructed the original files with 100% accuracy. Theoretical analysis indicates that our DNA-storage scheme scales far beyond current global information volumes. These results demonstrate DNA-storage to be a realistic technology for large-scale digital archiving that may already be cost-effective for low access, multi-century-long archiving tasks. Within a decade, as costs fall rapidly under realistic scenarios for technological advances, it may be cost-effective for sub-50-year archival. Although techniques for manipulating, storing and copying large amounts of DNA have been established for many years11–13, these rely on the availability of initial copies of the DNA molecule to be processed, and one of the main challenges for practical information storage in DNA is the difficulty of synthesising long sequences of DNA de novo to an exactly-specified design. Instead, we developed an in vitro approach that represents the information being stored as a hypothetical long DNA molecule and encodes this using shorter DNA fragments. A similar approach was proposed by Church et al.9 in a report *To whom correspondence should be addressed; goldman@ebi.ac.uk. Supplementary Information is provided as a number of separate files accompanying this document. Author Contributions N.G. and E.B. conceived and planned the project and devised the information encoding methods. P.B. advised on NGS protocols, prepared the DNA library and managed the sequencing process. S.C. and E.M.L. provided custom oligonucleotides. N.G. wrote the software for encoding and decoding information into/from DNA and analysed the data. N.G., E.B., C.D. and B.S. modelled the scaling properties of DNA-storage. N.G. wrote the paper with discussions and contributions from all other authors. N.G. and C.D. produced the figures. Author Information Data are available online at http://www.ebi.ac.uk/goldman-srv/DNA-storage and in the Sequence Read Archive (SRA) with accession number ERP002040 (to be confirmed). Correspondence and requests for materials should be addressed to N.G. (goldman@ebi.ac.uk). Competing Financial Interests The authors declare competing financial interests: details have been uploaded via Nature’s online manuscript tracking system. Europe PMC Funders Group Author Manuscript Nature. Author manuscript; available in PMC 2013 August 07. Published in final edited form as: Nature. 2013 February 7; 494(7435): 77–80. doi:10.1038/nature11875. E uope PM C Fuders A uhor M ancripts E uope PM C Fuders A uhor M ancripts submitted and published while this manuscript was in review. Isolated DNA fragments are easily manipulated in vitro11,13, and the routine recovery of intact fragments from samples that are tens of thousands of years old14,15 indicates that a well-prepared synthetic DNA sample should have an exceptionally long lifespan in low-maintenance environments3,4. In contrast, systems based on living vectors6–8 would not be reliable, scalable or cost-efficient, having disadvantages including constraints on the genomic elements and locations that can be manipulated without affecting viability, the fact that mutation will cause the fidelity of stored and decoded information to reduce over time and possibly the requirement for storage conditions to be carefully regulated. Existing schemes in the field of DNA computing in principle permit large-scale memory1,16, but data encoding in DNA computing is inextricably linked to the specific application or algorithm17 and no practical schemes have been realised. We selected computer files to be encoded as a proof of concept for practical DNA-storage, choosing a range of common formats to emphasise the ability to store arbitrary digital information. The five files comprised all 154 of Shakespeare’s sonnets (ASCII text), a classic scientific paper18 (PDF format), a medium-resolution colour photograph of the European Bioinformatics Institute (JPEG 2000 format), a 26 s excerpt from Martin Luther King’s 1963 “I Have A Dream” speech (MP3 format) and a Huffman code10 used in this study to convert bytes to base-3 digits (ASCII text), giving a total of 757,051 bytes (Shannon information10 5.2 × 106 bits). Full details are given in Supplementary Information and Supplementary Table S1. The bytes comprising each file were represented as single DNA sequences with no homopolymers (runs of ≥ 2 identical bases, which are associated with higher error rates in existing high-throughput sequencing technologies19 and led to errors in Church et al.’s experiment9). Each DNA sequence was split into overlapping segments, generating fourfold redundancy, and alternate segments were converted to their reverse complement (see Fig. 1 and Supplementary Information). These measures reduce the probability of systematic failure for any particular string, which could lead to uncorrectable errors and data loss. Each segment was then augmented with indexing information that permitted determination of the file from which it originated and its location within that file, and simple parity-check errordetection10. In all, the five files were represented by a total of 153,335 strings of DNA, each comprising 117 nt. An additional advantage of our encoding scheme is that the perfectly uniform fragment lengths and absence of homopolymers make it obvious that the synthesised DNA does not have a natural (biological) origin, and so imply the presence of deliberate design and encoded information2. Oligonucleotides (oligos) corresponding to our designed DNA strings were synthesised using an updated version of Agilent Technologies’ OLS (oligo library synthesis) process20. This created a large number (~2.5 × 106) of copies of each DNA string, with errors occurring only rarely (~1 error per 500 bases) and independently in the different copies of each string, again enhancing our method’s error tolerance. The synthesised DNA was supplied lyophilised, a form expected to have excellent long-term preservation characteristics3,4, and was shipped (at ambient temperature, without specialised packaging) from the USA to Germany via the UK. After resuspension, amplification and purification, a sample of the resulting library products was sequenced in paired-end mode on the Illumina HiSeq 2000. The remainder of the library was transferred to multiple aliquots and relyophilised for long-term storage. Base calling using AYB21 yielded 79.6M read-pairs of 104 bases in length. Full-length (117 nt) DNA strings were reconstructed in silico from the read-pairs, with those containing uncertainties due to synthesis or sequencing errors being discarded. The remaining strings Goldman et al. Page 2 Nature. Author manuscript; available in PMC 2013 August 07. E uope PM C Fuders A uhor M ancripts E uope PM C Fuders A uhor M ancripts were then decoded using the reverse of the encoding procedure, with the error-detection bases and properties of the coding scheme allowing us to discard further strings containing errors. While many discarded strings will have contained information that could be recovered with more sophisticated decoding, the high level of redundancy and sequencing coverage rendered this unnecessary in our experiment. Full-length DNA sequences representing the original encoded files were then reconstructed in silico. The decoding process used no additional information derived from knowledge of the experimental design. Full details of the encoding, sequencing and decoding processes are given in Supplementary Information. Four of the five resulting DNA sequences could be fully decoded without intervention. The fifth however contained two gaps: runs of 25 bases each for which no segment was detected corresponding to the original DNA. Each of these gaps was caused by the failure to sequence any oligo representing any of four consecutive overlapping segments. Inspection of the neighbouring regions of the reconstructed sequence permitted us to hypothesise what the missing nucleotides should have been (see Supplementary Information) and we manually inserted those 50 bases accordingly. This sequence could also then be decoded. Inspection confirmed that our original computer files had been reconstructed with 100% accuracy. To investigate its suitability for long-term digital archiving, we studied how DNA-storage scales to larger applications. The number of bases of synthesised DNA needed to encode information grows linearly with the amount of information to be stored, but we must also consider the indexing information required to reconstruct full-length files from short fragments. As indexing information grows only as the logarithm of the number of fragments to be indexed, the total amount of synthesised DNA required grows sub-linearly. Increasingly-large parts of each fragment are needed for indexing however and, although it is reasonable to expect synthesis of longer strings to be possible in future, we modelled the behaviour of our scheme under the conservative constraint of a constant 114 nt available for both data and indexing information (see

[1]  Lila Kari,et al.  DNA computing: a research snapshot , 2010 .

[2]  Thomas H Segall-Shapiro,et al.  Creation of a Bacterial Cell Controlled by a Chemically Synthesized Genome , 2010, Science.

[3]  Emily M. LeProust,et al.  Synthesis of high-quality libraries of long (150mer) oligonucleotides by a novel depurination controlled process , 2010, Nucleic acids research.

[4]  J P Cox,et al.  Long-term data storage in DNA. , 2001, Trends in biotechnology.

[5]  Gheorghe Paun,et al.  DNA Computing: New Computing Paradigms , 1998 .

[6]  Peter A Carr,et al.  Genome engineering , 2009, Nature Biotechnology.

[7]  Marion Boyer,et al.  The Clock of the Long Now , 2009 .

[8]  Tim Massingham,et al.  All Your Base: a fast and accurate probabilistic approach to base calling , 2012, Genome Biology.

[9]  Steve Murray,et al.  Increasing the efficiency of tape-based storage backends , 2010 .

[10]  Philip L. F. Johnson,et al.  A Draft Sequence of the Neandertal Genome , 2010, Science.

[11]  Mary Baker,et al.  A fresh look at the reliability of long-term digital storage , 2005, EuroSys.

[12]  Martin Yuille,et al.  The UK DNA banking network: a “fair access” biobank , 2009, Cell and Tissue Banking.

[13]  G. Church,et al.  Next-Generation Digital Information Storage in DNA , 2012, Science.

[14]  Menachem Ailenberg,et al.  An improved Huffman coding method for archiving text, images, and music characters in DNA. , 2009, BioTechniques.

[15]  Catherine Taylor Clelland,et al.  Hiding messages in DNA microdots , 1999, Nature.

[16]  A. Monaco,et al.  YACs, BACs, PACs and MACs: artificial chromosomes as research tools. , 1994, Trends in biotechnology.

[17]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[18]  F. Crick,et al.  Molecular structure of nucleic acids , 2004, JAMA.

[19]  E B Baum,et al.  Building an associative memory vastly larger than the brain. , 1995, Science.

[20]  James Haile,et al.  Ancient Biomolecules from Deep Ice Cores Reveal a Forested Southern Greenland , 2007, Science.

[21]  T. Anchordoquy,et al.  Preservation of DNA , 2007 .

[22]  Matthew B. Kerby,et al.  Landscape of next-generation sequencing technologies. , 2011, Analytical chemistry.

[23]  J. Sninsky,et al.  Recent advances in the polymerase chain reaction , 1991, Science.

[24]  J. Bonnet,et al.  Chain and conformation stability of solid-state DNA: implications for room temperature storage , 2009, Nucleic acids research.