Repetition Coding as an Effective Error Correction Code for Information Encoded in DNA

The goal of DNA data embedding is to enable robust encoding of non-genetic information in DNA. This field straddles the areas of bioinformatics and digital communications, since DNA mutations can be seen as akin to a noisy channel from the point of view of information encoding. In this paper we present two algorithms which, building on a variant of a method proposed by Yachie et al., rely on repetition coding to effectively counteract the impact that mutations have on an embedded message. The algorithms are designed for resynchronising multiple, originally identical, information encoded DNA sequences, embedded within non-coding DNA (ncDNA) sections of a host genome. They use both BLAST and MUSCLE algorithms to accomplish this. Bit error rates at the decoder are established for mutations rates accumulated over a number of generations of the host organism. The empirical results obtained are compared to a theoretical bound for optimal decoding.

[1]  Félix Balado On the embedding capacity of DNA strands under substitution, insertion, and deletion mutations , 2010, Electronic Imaging.

[2]  Richard D. Wesel,et al.  A Tighter Bhattacharyya Bound for Decoding Error Probability , 2007, IEEE Communications Letters.

[3]  Catherine Taylor Clelland,et al.  Hiding messages in DNA microdots , 1999, Nature.

[4]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[5]  M. Tomita,et al.  Alignment‐Based Approach for Durable Data Storage into Living Organisms , 2007, Biotechnology progress.

[6]  M. Kimura A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences , 1980, Journal of Molecular Evolution.

[7]  Dominik Heider,et al.  DNA-based watermarks using the DNA-Crypt algorithm , 2007, BMC Bioinformatics.

[8]  Félix Balado,et al.  Capacity of DNA Data Embedding Under Substitution Mutations , 2011, IEEE Transactions on Information Theory.

[9]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[10]  M. Kreitman,et al.  Variation in the ratio of nucleotide substitution and indel rates across genomes in mammals and bacteria. , 2009, Molecular biology and evolution.

[11]  Andy Purvis,et al.  Estimating the Transition/Transversion Ratio from Independent Pairwise Comparisons with an Assumed Phylogeny , 1997, Journal of Molecular Evolution.

[12]  Timothy B. Stockwell,et al.  Complete Chemical Synthesis, Assembly, and Cloning of a Mycoplasma genitalium Genome , 2008, Science.