Coding theory based models for protein translation initiation in prokaryotic organisms.

Our research explores the feasibility of using communication theory, error control (EC) coding theory specifically, for quantitatively modeling the protein translation initiation mechanism. The messenger RNA (mRNA) of Escherichia coli K-12 is modeled as a noisy (errored), encoded signal and the ribosome as a minimum Hamming distance decoder, where the 16S ribosomal RNA (rRNA) serves as a template for generating a set of valid codewords (the codebook). We tested the E. coli based coding models on 5' untranslated leader sequences of prokaryotic organisms of varying taxonomical relation to E. coli including: Salmonella typhimurium LT2, Bacillus subtilis, and Staphylococcus aureus Mu50. The model identified regions on the 5' untranslated leader where the minimum Hamming distance values of translated mRNA sub-sequences and non-translated genomic sequences differ the most. These regions correspond to the Shine-Dalgarno domain and the non-random domain. Applying the EC coding-based models to B. subtilis, and S. aureus Mu50 yielded results similar to those for E. coli K-12. Contrary to our expectations, the behavior of S. typhimurium LT2, the more taxonomically related to E. coli, resembled that of the non-translated sequence group.

[1]  Gunnar Rätsch,et al.  Engineering Support Vector Machine Kerneis That Recognize Translation Initialion Sites , 2000, German Conference on Bioinformatics.

[2]  Shu Lin,et al.  Error control coding : fundamentals and applications , 1983 .

[3]  Nikola Štambuk On the genetic origin of complementary protein coding , 1998 .

[4]  A. Stewart Genes V , 1994 .

[5]  Gérard Battail,et al.  Does information theory explain biological evolution , 1997 .

[6]  L S Liebovitch,et al.  Is there an error correcting code in the base sequence in DNA? , 1996, Biophysical journal.

[7]  Ajay Dholakia Introduction to convolutional codes with applications , 1994 .

[8]  M. Eigen,et al.  The origin of genetic information: viruses as models. , 1993, Gene.

[9]  Mladen A. Vouk,et al.  Analysis of coding theory based models for initiating protein translation in prokaryotic organisms , 2002 .

[10]  Martin Tompa,et al.  Quality Control in Manufacturing Oligo Arrays: A Combinatorial Design Approach , 2001, Pacific Symposium on Biocomputing.

[11]  C J Michel,et al.  A code in the protein coding genes. , 1997, Bio Systems.

[12]  Steven Salzberg,et al.  A probabilistic method for identifying start codons in bacterial genomes , 2001, Bioinform..

[13]  H Almagor Nucleotide distribution and the recognition of coding regions in DNA sequences: an information theory approach. , 1985, Journal of theoretical biology.

[14]  J. Oliver,et al.  Entropic profiles of DNA sequences through chaos-game-derived images. , 1993, Journal of theoretical biology.

[15]  T. D. Schneider,et al.  Theory of molecular machines. II. Energy dissipation from molecular machines. , 1991, Journal of theoretical biology.

[16]  Gail L. Rosen,et al.  Investigation of coding structure in DNA , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[17]  Brian Hayes,et al.  THE INVENTION OF THE GENETIC CODE , 1998 .

[18]  Simon Kasif,et al.  A comparative genomic method for computational identification of prokaryotic translation initiation sites. , 2002, Nucleic acids research.

[19]  H. P. Yockey,et al.  Information Theory And Molecular Biology , 1992 .

[20]  M. Borodovsky,et al.  GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. , 2001, Nucleic acids research.

[21]  Jeffrey W. Roberts,et al.  遺伝子の分子生物学 = Molecular biology of the gene , 1970 .

[22]  Peter F. Sweeney Error control coding - an introduction , 1991 .

[23]  Dónall A. Mac Dónaill,et al.  A parity code interpretation of nucleotide alphabet composition , 2002 .

[24]  Peter Salamon,et al.  A Maximum Entropy Principle for the Distribution of Local Complexity in Naturally Occurring Nucleotide Sequences , 1992, Comput. Chem..

[25]  Mac Dónaill Da A parity code interpretation of nucleotide alphabet composition. , 2002 .

[26]  Thomas D. Schneider,et al.  Fast Multiple Alignment of Ungapped DNA Sequences Using Information Theory and a Relaxation Method , 1996, Discret. Appl. Math..

[27]  T. D. Schneider,et al.  Information content of binding sites on nucleotide sequences. , 1986, Journal of molecular biology.

[28]  A. Pavesi,et al.  On the Informational Content of Overlapping Genes in Prokaryotic and Eukaryotic Viruses , 1997, Journal of Molecular Evolution.

[29]  A. B. Roy,et al.  Topological information content of genetic molecules—I. , 1978 .

[30]  S. TD.,et al.  Information Content of Individual Genetic Sequences , 1998 .

[31]  Ramón Román-Roldán,et al.  Application of information theory to DNA sequence analysis: A review , 1996, Pattern Recognit..

[32]  J W Fickett,et al.  Bacterial start site prediction. , 1999, Nucleic acids research.

[33]  Mladen A. Vouk,et al.  The ribosome as a table-driven convolutional decoder for the Escherichia coli K-12 translation initiation system , 2000, Proceedings of the 22nd Annual International Conference of the IEEE Engineering in Medicine and Biology Society (Cat. No.00CH37143).

[34]  Jarkko Kari,et al.  Reversible Molecular Computation in Ciliates , 1999, Jewels are Forever.

[35]  Martin Tompa,et al.  An Exact Method for Finding Short Motifs in Sequences, with Application to the Ribosome Binding Site Problem , 1999, ISMB.

[36]  S. Altschul Amino acid substitution matrices from an information theoretic perspective , 1991, Journal of Molecular Biology.

[37]  T Yada,et al.  A novel bacterial gene-finding system with improved accuracy in locating start codons. , 2001, DNA research : an international journal for rapid publication of reports on genes and genomes.

[38]  Mladen A. Vouk,et al.  Constructing optimal convolutional code models for prokaryotic translation initiation , 2002, Proceedings of the Second Joint 24th Annual Conference and the Annual Fall Meeting of the Biomedical Engineering Society] [Engineering in Medicine and Biology.

[39]  T. D. Schneider,et al.  Theory of molecular machines. I. Channel capacity of molecular machines. , 1991, Journal of theoretical biology.

[40]  Martin Tompa,et al.  Quality Control in Manufacturing Oligo Arrays: A Combinatorial Design Approach , 2002, J. Comput. Biol..

[41]  Nikola Štambuk On Circular Coding Properties of Gene and Protein Sequences , 1999 .

[42]  T D Schneider,et al.  Measuring molecular information. , 1999, Journal of theoretical biology.

[43]  David Loewenstern,et al.  Significantly Lower Entropy Estimates for Natural DNA Sequences , 1999, J. Comput. Biol..

[44]  Mladen A. Vouk,et al.  Coding model for translation in E. coli K-12 , 1999, Proceedings of the First Joint BMES/EMBS Conference. 1999 IEEE Engineering in Medicine and Biology 21st Annual Conference and the 1999 Annual Fall Meeting of the Biomedical Engineering Society (Cat. N.

[45]  M. Gelfand,et al.  Starts of bacterial genes: estimating the reliability of computer predictions. , 1999, Gene.

[46]  T B Fowler,et al.  Computation as a thermodynamic process applied to biological systems. , 1979, International journal of bio-medical computing.

[47]  T G Dewey,et al.  The Shannon information entropy of protein sequences. , 1996, Biophysical journal.