BioCode: Two biologically compatible Algorithms for embedding data in non-coding and coding regions of DNA

BackgroundIn recent times, the application of deoxyribonucleic acid (DNA) has diversified with the emergence of fields such as DNA computing and DNA data embedding. DNA data embedding, also known as DNA watermarking or DNA steganography, aims to develop robust algorithms for encoding non-genetic information in DNA. Inherently DNA is a digital medium whereby the nucleotide bases act as digital symbols, a fact which underpins all bioinformatics techniques, and which also makes trivial information encoding using DNA straightforward. However, the situation is more complex in methods which aim at embedding information in the genomes of living organisms. DNA is susceptible to mutations, which act as a noisy channel from the point of view of information encoded using DNA. This means that the DNA data embedding field is closely related to digital communications. Moreover it is a particularly unique digital communications area, because important biological constraints must be observed by all methods. Many DNA data embedding algorithms have been presented to date, all of which operate in one of two regions: non-coding DNA (ncDNA) or protein-coding DNA (pcDNA).ResultsThis paper proposes two novel DNA data embedding algorithms jointly called BioCode, which operate in ncDNA and pcDNA, respectively, and which comply fully with stricter biological restrictions. Existing methods comply with some elementary biological constraints, such as preserving protein translation in pcDNA. However there exist further biological restrictions which no DNA data embedding methods to date account for. Observing these constraints is key to increasing the biocompatibility and in turn, the robustness of information encoded in DNA.ConclusionThe algorithms encode information in near optimal ways from a coding point of view, as we demonstrate by means of theoretical and empirical (in silico) analyses. Also, they are shown to encode information in a robust way, such that mutations have isolated effects. Furthermore, the preservation of codon statistics, while achieving a near-optimum embedding rate, implies that BioCode pcDNA is also a near-optimum first-order steganographic method.

[1]  Dominik Heider and Angelika Barnekow DNA Watermarking: Challenging Perspectives for Biotechnological Applications , 2011 .

[2]  David Haughton,et al.  Repetition Coding as an Effective Error Correction Code for Information Encoded in DNA , 2011, 2011 IEEE 11th International Conference on Bioinformatics and Bioengineering.

[3]  David Haughton,et al.  Gene tagging and the data hiding rate , 2012 .

[4]  Timothy B. Stockwell,et al.  Complete Chemical Synthesis, Assembly, and Cloning of a Mycoplasma genitalium Genome , 2008, Science.

[5]  Satyabrata Sahoo,et al.  Analyzing gene expression from relative codon usage bias in Yeast genome: a statistical significance and biological relevance. , 2009, Gene.

[6]  Pak Chung Wong,et al.  Organic data memory using the DNA approach , 2003, CACM.

[7]  Michael Liss,et al.  Embedding Permanent Watermarks in Synthetic Genes , 2012, PloS one.

[8]  Miodrag Potkonjak,et al.  Hiding Data in DNA , 2002, Information Hiding.

[9]  Catherine Taylor Clelland,et al.  Hiding messages in DNA microdots , 1999, Nature.

[10]  Masanori Arita,et al.  Secret Signatures Inside Genomic DNA , 2004, Biotechnology progress.

[11]  M. Tomita,et al.  Alignment‐Based Approach for Durable Data Storage into Living Organisms , 2007, Biotechnology progress.

[12]  Félix Balado,et al.  On the Shannon capacity of DNA data embedding , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Dominik Heider,et al.  Watermarking sexually reproducing diploid organisms , 2008, Bioinform..

[14]  Geoff C. Smith,et al.  Some possible codes for encrypting data in DNA , 2003, Biotechnology Letters.

[15]  M. Kreitman,et al.  Variation in the ratio of nucleotide substitution and indel rates across genomes in mammals and bacteria. , 2009, Molecular biology and evolution.

[16]  David Haughton,et al.  A modified watermark synchronisation code for robust embedding of data in DNA , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Yizhar Lavner,et al.  Codon bias as a factor in regulating expression via translation rate in the human genome. , 2005, Gene.

[18]  Dominik Heider,et al.  DNA-based watermarks using the DNA-Crypt algorithm , 2007, BMC Bioinformatics.

[19]  Félix Balado,et al.  Capacity of DNA Data Embedding Under Substitution Mutations , 2011, IEEE Transactions on Information Theory.

[20]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[21]  Andy Purvis,et al.  Estimating the Transition/Transversion Ratio from Independent Pairwise Comparisons with an Assumed Phylogeny , 1997, Journal of Molecular Evolution.

[22]  Frederick F. Sellers,et al.  Bit loss and gain correction code , 1962, IRE Trans. Inf. Theory.

[23]  David Haughton,et al.  Performance of DNA data embedding algorithms under substitution mutations , 2010, 2010 IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW).

[24]  Maido Remm,et al.  Preferred and avoided codon pairs in three domains of life , 2008, BMC Genomics.

[25]  D. Heider,et al.  DNA watermarks: A proof of concept , 2008, BMC Molecular Biology.

[26]  David J. C. MacKay,et al.  Reliable communication over channels with insertions, deletions, and substitutions , 2001, IEEE Trans. Inf. Theory.

[27]  ENCODEConsortium,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[28]  J. Samuel,et al.  DNA Watermarking of Infectious Agents: Progress and Prospects , 2010, PLoS pathogens.