An error-correcting code framework for genetic sequence analysis

Abstract A fundamental challenge for engineering communication systems is the problem of transmitting information from the source to the receiver over a noisy channel. This same problem exists in a biological system. How can information required for the proper functioning of a cell, an organism, or a species be transmitted in an error introducing environment? Source codes (compression codes) and channel codes (error-correcting codes) address this problem in engineering communication systems. The ability to extend these information theory concepts to study information transmission in biological systems can contribute to the general understanding of biological communication mechanisms and extend the field of coding theory into the biological domain. In this work, we review and compare existing coding theoretic methods for modeling genetic systems. We introduce a new error-correcting code framework for understanding translation initiation, at the cellular level and present research results for Escherichia coli K-12. By studying translation initiation, we hope to gain insight into potential error-correcting aspects of genomic sequences and systems.

[1]  A K Konopka Theory of degenerate coding and informational parameters of protein coding genes. , 1985, Biochimie.

[2]  Mladen A. Vouk,et al.  Analysis of coding theory based models for initiating protein translation in prokaryotic organisms , 2002 .

[3]  D. Forsdyke,et al.  Are introns in-series error-detecting sequences? , 1981, Journal of theoretical biology.

[4]  L S Liebovitch,et al.  Is there an error correcting code in the base sequence in DNA? , 1996, Biophysical journal.

[5]  T D Schneider,et al.  Measuring molecular information. , 1999, Journal of theoretical biology.

[6]  Nikola Štambuk On Circular Coding Properties of Gene and Protein Sequences , 1999 .

[7]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[8]  Shu Lin,et al.  Error control coding : fundamentals and applications , 1983 .

[9]  Mac Dónaill Da A parity code interpretation of nucleotide alphabet composition. , 2002 .

[10]  S. TD.,et al.  Information Content of Individual Genetic Sequences , 1998 .

[11]  Peter Salamon,et al.  A Maximum Entropy Principle for the Distribution of Local Complexity in Naturally Occurring Nucleotide Sequences , 1992, Comput. Chem..

[12]  T Yada,et al.  A novel bacterial gene-finding system with improved accuracy in locating start codons. , 2001, DNA research : an international journal for rapid publication of reports on genes and genomes.

[13]  Richard J. Lipton,et al.  Making DNA computers error resistant , 1996, DNA Based Computers.

[14]  Man Ieee Systems Proceedings of the 1984 International Conference on Cybernetics and Society,October 10, 11, & 12 1984, Chateau Halifax, Halifax, Nova Scotia, Canada , 1984 .

[15]  S. Altschul Amino acid substitution matrices from an information theoretic perspective , 1991, Journal of Molecular Biology.

[16]  H. Hirsh,et al.  Maximum A posteriori classification of DNA structure from sequence information. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[17]  M. Eigen,et al.  The origin of genetic information: viruses as models. , 1993, Gene.

[18]  John H. Reif,et al.  Computationally Inspired Biotechnologies: Improved DNA Synthesis and Associative Search Using Error-Correcting Codes and Vector-Quantization , 2000, DNA Computing.

[19]  Carlo C. Maley,et al.  DNA Computation: Theory, Practice, and Prospects , 1998, Evolutionary Computation.

[20]  Erik Winfree,et al.  On applying molecular computation to the data encryption standard , 1999, DNA Based Computers.

[21]  L F Landweber,et al.  The evolution of cellular computing: nature's solution to a computational problem. , 1999, Bio Systems.

[22]  Martin Tompa,et al.  An Exact Method for Finding Short Motifs in Sequences, with Application to the Ribosome Binding Site Problem , 1999, ISMB.

[23]  J. Oliver,et al.  Entropic profiles of DNA sequences through chaos-game-derived images. , 1993, Journal of theoretical biology.

[24]  Jeffrey W. Roberts,et al.  遺伝子の分子生物学 = Molecular biology of the gene , 1970 .

[25]  John B. Anderson,et al.  Source and Channel Coding: An Algorithmic Approach , 1991 .

[26]  T G Dewey,et al.  The Shannon information entropy of protein sequences. , 1996, Biophysical journal.

[27]  Anders Gorm Pedersen,et al.  Neural Network Prediction of Translation Initiation Sites in Eukaryotes: Perspectives for EST and Genome Analysis , 1997, ISMB.

[28]  A. Stewart Genes V , 1994 .

[29]  Martin Tompa,et al.  Quality Control in Manufacturing Oligo Arrays: A Combinatorial Design Approach , 2002, J. Comput. Biol..

[30]  Ramón Román-Roldán,et al.  Application of information theory to DNA sequence analysis: A review , 1996, Pattern Recognit..

[31]  Ajay Dholakia Introduction to convolutional codes with applications , 1994 .

[32]  Mladen A. Vouk,et al.  A table-driven (feedback) decoder , 1991, [1991 Proceedings] Tenth Annual International Phoenix Conference on Computers and Communications.

[33]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[34]  David Loewenstern,et al.  Significantly lower entropy estimates for natural DNA sequences , 1997, Proceedings DCC '97. Data Compression Conference.

[35]  Thomas D. Schneider,et al.  Fast Multiple Alignment of Ungapped DNA Sequences Using Information Theory and a Relaxation Method , 1996, Discret. Appl. Math..

[36]  Gail L. Rosen,et al.  Investigation of coding structure in DNA , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[37]  Massimo Di Giulio,et al.  The RNA world, the genetic code and the tRNA molecule , 2000 .

[38]  Gérard Battail,et al.  Does information theory explain biological evolution , 1997 .

[39]  T. D. Schneider,et al.  Information content of binding sites on nucleotide sequences. , 1986, Journal of molecular biology.

[40]  T B Fowler,et al.  Computation as a thermodynamic process applied to biological systems. , 1979, International journal of bio-medical computing.

[41]  Brian Hayes,et al.  THE INVENTION OF THE GENETIC CODE , 1998 .

[42]  T. D. Schneider,et al.  Theory of molecular machines. II. Energy dissipation from molecular machines. , 1991, Journal of theoretical biology.

[43]  M. Gelfand,et al.  Starts of bacterial genes: estimating the reliability of computer predictions. , 1999, Gene.

[44]  D R Powell,et al.  Discovering simple DNA sequences by compression. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[45]  Max Dauchet,et al.  Location of Repetitive Regions in Sequences By Optimizing A Compression Method , 1998, Pacific Symposium on Biocomputing.

[46]  J W Fickett,et al.  Bacterial start site prediction. , 1999, Nucleic acids research.

[47]  A. Pavesi,et al.  On the Informational Content of Overlapping Genes in Prokaryotic and Eukaryotic Viruses , 1997, Journal of Molecular Evolution.

[48]  Gunnar Rätsch,et al.  Engineering Support Vector Machine Kerneis That Recognize Translation Initialion Sites , 2000, German Conference on Bioinformatics.

[49]  Lila L. Gatlin,et al.  Information theory and the living system , 1972 .

[50]  Jarkko Kari,et al.  Reversible Molecular Computation in Ciliates , 1999, Jewels are Forever.

[51]  H. P. Yockey,et al.  Information Theory And Molecular Biology , 1992 .

[52]  R. Dawkins The Blind Watchmaker , 1986 .

[53]  Mladen A. Vouk,et al.  Coding model for translation in E. coli K-12 , 1999, Proceedings of the First Joint BMES/EMBS Conference. 1999 IEEE Engineering in Medicine and Biology 21st Annual Conference and the 1999 Annual Fall Meeting of the Biomedical Engineering Society (Cat. N.

[54]  Gheorghe Paun,et al.  Jewels are Forever, Contributions on Theoretical Computer Science in Honor of Arto Salomaa , 1999 .

[55]  H Almagor Nucleotide distribution and the recognition of coding regions in DNA sequences: an information theory approach. , 1985, Journal of theoretical biology.

[56]  Dónall A. Mac Dónaill,et al.  A parity code interpretation of nucleotide alphabet composition , 2002 .

[57]  A. B. Roy,et al.  Topological information content of genetic molecules—I. , 1978 .

[58]  F. Neidhardt,et al.  Escherichia Coli and Salmonella: Typhimurium Cellular and Molecular Biology , 1987 .

[59]  T. D. Schneider,et al.  Theory of molecular machines. I. Channel capacity of molecular machines. , 1991, Journal of theoretical biology.

[60]  R. Blahut Theory and practice of error control codes , 1983 .

[61]  Masaru Tomita,et al.  Analysis of base-pairing potentials between 16S rRNA and 5' UTR for translation initiation in various prokaryotes , 1999, Bioinform..

[62]  Simon Kasif,et al.  A comparative genomic method for computational identification of prokaryotic translation initiation sites. , 2002, Nucleic acids research.

[63]  Elwyn R. Berlekamp,et al.  Algebraic coding theory , 1984, McGraw-Hill series in systems science.

[64]  L F Landweber The evolution of cellular computing. , 1999, The Biological bulletin.

[65]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[66]  Martin Tompa,et al.  Quality Control in Manufacturing Oligo Arrays: A Combinatorial Design Approach , 2001, Pacific Symposium on Biocomputing.

[67]  L F Landweber,et al.  Guilt by association: the arginine case revisited. , 2000, RNA.

[68]  R. Lewontin ‘The Selfish Gene’ , 1977, Nature.

[69]  C J Michel,et al.  A code in the protein coding genes. , 1997, Bio Systems.

[70]  M. Borodovsky,et al.  GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. , 2001, Nucleic acids research.

[71]  E. Szathmáry,et al.  The origin of the genetic code: amino acids as cofactors in an RNA world. , 1999, Trends in genetics : TIG.

[72]  Peter F. Sweeney Error control coding - an introduction , 1991 .

[73]  Mladen A. Vouk,et al.  The ribosome as a table-driven convolutional decoder for the Escherichia coli K-12 translation initiation system , 2000, Proceedings of the 22nd Annual International Conference of the IEEE Engineering in Medicine and Biology Society (Cat. No.00CH37143).

[74]  M. Daly,et al.  Guilt by association , 2000, Nature Genetics.

[75]  Steven Salzberg,et al.  A probabilistic method for identifying start codons in bacterial genomes , 2001, Bioinform..