Analysis of coding theory based models for initiating protein translation in prokaryotic organisms

Rapid advances in both genomic data acquisition and computational technology has encouraged the development and use of engineering based methods in the field of bioinformatics and computational genomics. Several researchers are encouraging the use of coding theory, specifically error-correction coding, in analyzing genetic data [1]. A goal of current work in this context is to use coding theory based analysis to determine whether regions of the specified genome are protein-producing sequences. Using information theory, coding theory specifically, this work develops a coding theory view of the translation initiation process in prokaryotic organisms, paralleling the translation of messenger RNA into amino acid sequences to the decoding of noisy, convolutionally encoded parity streams. This work presents a genetic algorithms-based method for the design of optimal table-based convolutional coding models for translation initiation sites using Escherichia coli K-12 as the model organism. Sequence and function based convolutional coding models are constructed and applied to prokaryotic organisms of varying taxonomical relation including: Escherichia coli K-12, Salmonella typhimurium LT2, Bacillus subtilis, and Staphylococcus aureus Mu50. Model analysis and results are presented.

[1]  Lila Kari,et al.  Codes, Involutions, and DNA Encodings , 2002, Formal and Natural Computing.

[2]  Mladen A. Vouk,et al.  Coding theory based maximum-likelihood classifier for translation initiation regions in Escherichia coli K-12 , 2000 .

[3]  Mladen A. Vouk,et al.  The ribosome as a table-driven convolutional decoder for the Escherichia coli K-12 translation initiation system , 2000, Proceedings of the 22nd Annual International Conference of the IEEE Engineering in Medicine and Biology Society (Cat. No.00CH37143).

[4]  Stephen M. Mount,et al.  The genome sequence of Drosophila melanogaster. , 2000, Science.

[5]  D. Bitzer,et al.  Free energy periodicity in E.coli , 1999, Proceedings of the First Joint BMES/EMBS Conference. 1999 IEEE Engineering in Medicine and Biology 21st Annual Conference and the 1999 Annual Fall Meeting of the Biomedical Engineering Society (Cat. N.

[6]  T D Schneider,et al.  Measuring molecular information. , 1999, Journal of theoretical biology.

[7]  Mladen A. Vouk,et al.  Coding model for translation in E. coli K-12 , 1999, Proceedings of the First Joint BMES/EMBS Conference. 1999 IEEE Engineering in Medicine and Biology 21st Annual Conference and the 1999 Annual Fall Meeting of the Biomedical Engineering Society (Cat. N.

[8]  David Coley,et al.  Introduction to Genetic Algorithms for Scientists and Engineers , 1999 .

[9]  Nikola Štambuk On Circular Coding Properties of Gene and Protein Sequences , 1999 .

[10]  Nikola Štambuk On the genetic origin of complementary protein coding , 1998 .

[11]  M. Borodovsky,et al.  GeneMark.hmm: new solutions for gene finding. , 1998, Nucleic acids research.

[12]  T. D. Schneider,et al.  Information content of individual genetic sequences. , 1997, Journal of theoretical biology.

[13]  Gérard Battail,et al.  Does information theory explain biological evolution , 1997 .

[14]  C J Michel,et al.  A code in the protein coding genes. , 1997, Bio Systems.

[15]  William Noble Grundy,et al.  Meta-MEME: motif-based hidden Markov models of protein families , 1997, Comput. Appl. Biosci..

[16]  A. Pavesi,et al.  On the Informational Content of Overlapping Genes in Prokaryotic and Eukaryotic Viruses , 1997, Journal of Molecular Evolution.

[17]  David Loewenstern,et al.  Significantly lower entropy estimates for natural DNA sequences , 1997, Proceedings DCC '97. Data Compression Conference.

[18]  I. Cosic The resonant recognition model of macromolecular bioactivity : theory and applications , 1997 .

[19]  Steven Salzberg,et al.  Finding Genes in DNA with a Hidden Markov Model , 1997, J. Comput. Biol..

[20]  Thomas D. Schneider,et al.  Fast Multiple Alignment of Ungapped DNA Sequences Using Information Theory and a Relaxation Method , 1996, Discret. Appl. Math..

[21]  Emmanuel Bacry,et al.  Wavelet based fractal analysis of DNA sequences , 1996 .

[22]  T G Dewey,et al.  The Shannon information entropy of protein sequences. , 1996, Biophysical journal.

[23]  Ramón Román-Roldán,et al.  Application of information theory to DNA sequence analysis: A review , 1996, Pattern Recognit..

[24]  S. Eddy Hidden Markov models. , 1996, Current opinion in structural biology.

[25]  James W. Fickett,et al.  The Gene Identification Problem: An Overview for Developers , 1995, Comput. Chem..

[26]  William R. Pearson Protein sequence comparison and protein evolution , 1995, ISMB 1995.

[27]  D. Haussler,et al.  A hidden Markov model that finds genes in E. coli DNA. , 1994, Nucleic acids research.

[28]  Ajay Dholakia Introduction to convolutional codes with applications , 1994 .

[29]  G. Christian Overton,et al.  Application of hidden Markov modeling to the characterization of transcription factor binding sites , 1994, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

[30]  M. Eigen,et al.  The origin of genetic information: viruses as models. , 1993, Gene.

[31]  Mark Borodovsky,et al.  GENMARK: Parallel Gene Recognition for Both DNA Strands , 1993, Comput. Chem..

[32]  J. Oliver,et al.  Entropic profiles of DNA sequences through chaos-game-derived images. , 1993, Journal of theoretical biology.

[33]  Peter Salamon,et al.  A Maximum Entropy Principle for the Distribution of Local Complexity in Naturally Occurring Nucleotide Sequences , 1992, Comput. Chem..

[34]  E. Uberbacher,et al.  Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[35]  S. Altschul Amino acid substitution matrices from an information theoretic perspective , 1991, Journal of Molecular Biology.

[36]  Peter F. Sweeney Error control coding - an introduction , 1991 .

[37]  Mladen A. Vouk,et al.  A table-driven (feedback) decoder , 1991, [1991 Proceedings] Tenth Annual International Phoenix Conference on Computers and Communications.

[38]  T. D. Schneider,et al.  Theory of molecular machines. I. Channel capacity of molecular machines. , 1991, Journal of theoretical biology.

[39]  T. D. Schneider,et al.  Theory of molecular machines. II. Energy dissipation from molecular machines. , 1991, Journal of theoretical biology.

[40]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[41]  T. D. Schneider,et al.  Sequence logos: a new way to display consensus sequences. , 1990, Nucleic acids research.

[42]  H Almagor Nucleotide distribution and the recognition of coding regions in DNA sequences: an information theory approach. , 1985, Journal of theoretical biology.

[43]  I. Cosic,et al.  Is it Possible to Analyze DNA and Protein Sequences by the Methods of Digital Signal Processing? , 1985, IEEE Transactions on Biomedical Engineering.

[44]  Shu Lin,et al.  Error control coding : fundamentals and applications , 1983 .

[45]  R. Blahut Theory and practice of error control codes , 1983 .

[46]  T B Fowler,et al.  Computation as a thermodynamic process applied to biological systems. , 1979, International journal of bio-medical computing.

[47]  A. B. Roy,et al.  Topological information content of genetic molecules—I. , 1978 .

[48]  Jeffrey W. Roberts,et al.  遺伝子の分子生物学 = Molecular biology of the gene , 1970 .

[49]  Elwyn R. Berlekamp,et al.  Algebraic coding theory , 1984, McGraw-Hill series in systems science.