Two proteins for the price of one: the design of maximally compressed coding sequences

The emerging field of synthetic biology moves beyond conventional genetic manipulation to construct novel life forms which do not originate in nature. We explore the problem of designing the provably shortest genomic sequence to encode a given set of genes by exploiting alternate reading frames. We present an algorithm for designing the shortest DNA sequence simultaneously encoding two given amino acid sequences. We show that the coding sequence of naturally occurring pairs of overlapping genes approach maximum compression. We also investigate the impact of alternate coding matrices on overlapping sequence design. Finally, we discuss an interesting application for overlapping gene design, namely the interleaving of an antibiotic resistance gene into a target gene inserted into a virus or plasmid for amplification.

[1]  Michal Galdzicki,et al.  Mammalian overlapping genes: the comparative perspective. , 2004, Genome research.

[2]  Masaru Tomita,et al.  Evolution of Overlapping Genes: Comparative Genomics of Mycoplasma genitalium and Mycoplasma pneumoniae , 1998 .

[3]  Masaru Tomita,et al.  On dynamics of overlapping genes in bacterial genomes. , 2003, Gene.

[4]  T. Miyata,et al.  Evolution of overlapping genes , 1978, Nature.

[5]  G. Church,et al.  Accurate multiplex gene synthesis from programmable DNA microchips , 2004, Nature.

[6]  A. Paul,et al.  Chemical Synthesis of Poliovirus cDNA: Generation of Infectious Virus in the Absence of Natural Template , 2002, Science.

[7]  Philip Ball,et al.  Synthetic biology: Starting from scratch , 2004, Nature.

[8]  A. Sali,et al.  Comparative protein structure modeling of genes and genomes. , 2000, Annual review of biophysics and biomolecular structure.

[9]  Mark Daley,et al.  Viral Gene Compression: Complexity and Verification , 2004, CIAA.

[10]  Serge Massar,et al.  Optimality of the genetic code with respect to protein stability and amino-acid frequencies , 2001, Genome Biology.

[11]  Viktor Hornak,et al.  Generation of accurate protein loop conformations through low‐barrier molecular dynamics , 2003, Proteins.

[12]  Samuel Karlin,et al.  Associations between human disease genes and overlapping gene groups and multiple amino acid runs , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Hans Mulder,et al.  Smart sensors to network the world. , 2004, Scientific American.

[14]  B. Alberts,et al.  Molecular Biology of the Cell 4th edition , 2007 .

[15]  David C. Krakauer,et al.  Evolutionary Principles of Genomic Compression , 2002 .

[16]  M. Karplus,et al.  Enhanced sampling in molecular dynamics: use of the time-dependent Hartree approximation for a simulation of carbon monoxide diffusion through myoglobin , 1990 .

[17]  Stephen J Freeland,et al.  Evolution encoded. , 2004, Scientific American.

[18]  J. Davies,et al.  Molecular Biology of the Cell , 1983, Bristol Medico-Chirurgical Journal.

[19]  Alan J. Cann,et al.  Principles of molecular virology , 1993 .

[20]  Steven Skiena,et al.  Designing better phages , 2001, ISMB.

[21]  Eugene V Koonin,et al.  Purifying and directional selection in overlapping prokaryotic genes. , 2002, Trends in genetics : TIG.

[22]  Steven Skiena,et al.  Natural Selection and Algorithmic Design of mRNA , 2003, J. Comput. Biol..

[23]  P. Keese,et al.  Origins of genes: "big bang" or continuous creation? , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[24]  C. Yanofsky,et al.  Translational coupling during expression of the tryptophan operon of Escherichia coli. , 1980, Genetics.

[25]  J Craig Venter,et al.  Generating a synthetic genome by whole genome assembly: φX174 bacteriophage from synthetic oligonucleotides , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[26]  M. Levitt A simplified representation of protein conformations for rapid simulation of protein folding. , 1976, Journal of molecular biology.

[27]  Mark Daley,et al.  Formal modelling of viral gene compression , 2005, Int. J. Found. Comput. Sci..

[28]  Sarah J Kodumal,et al.  Total synthesis of long DNA sequences: synthesis of a contiguous 32-kb polyketide synthase gene cluster. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[29]  M. Tomita,et al.  Comparative study of overlapping genes in the genomes of Mycoplasma genitalium and Mycoplasma pneumoniae. , 1999, Nucleic acids research.

[30]  David C. Krakauer,et al.  STABILITY AND EVOLUTION OF OVERLAPPING GENES , 2000, Evolution; international journal of organic evolution.

[31]  Kam-Fai Wong,et al.  Natural Language Processing - IJCNLP 2005, Second International Joint Conference, Jeju Island, Korea, October 11-13, 2005, Proceedings , 2005, IJCNLP.