DNA Composition, Codon Usage and Exon Prediction

Publisher Summary This chapter reviews the sequence-based measures indicative of protein-coding function in genomic DNA. A coding statistic can be defined as a function that computes given a DNA sequence a real number related to the likelihood that the sequence is coding for a protein. Model-dependent coding statistics are likely to capture more of the specific features of coding DNA since they are dependent on more parameters. It is suggested that model-dependent coding statistics may be more powerful in discriminating coding from noncoding DNA. A DNA sequence can be partitioned in a sequence of consecutive nonoverlapping codons in three different ways depending on the nucleotide in the sequence on which the grouping of nucleotides into codons starts. It is found that amino acid usage and codon preference carry a lot of information about coding function, and neither of these measures appears to be as discriminant as codon usage. The distribution of base frequencies at codon positions can be assumed to describe statistically a prototypical codon. The measures based on base compositional bias between codon positions are also elaborated.

[1]  J. Bertranpetit,et al.  Variation in G + C-content and codon choice: differences among synonymous codon groups in vertebrate genes. , 1989, Nucleic acids research.

[2]  Temple F. Smith,et al.  Prediction of gene structure. , 1992, Journal of molecular biology.

[3]  A. D. McLachlan,et al.  Codon preference and its use in identifying protein coding regions in long DNA sequences , 1982, Nucleic Acids Res..

[4]  R. Guigó,et al.  Computational gene identification , 1997, Journal of Molecular Medicine.

[5]  I Sauvaget,et al.  K-tuple frequency analysis: from intron/exon discrimination to T-cell epitope mapping. , 1990, Methods in enzymology.

[6]  J. Fickett Recognition of protein coding regions in DNA sequences. , 1982, Nucleic acids research.

[7]  M. Gelfand,et al.  Prediction of the exon-intron structure by a dynamic programming approach. , 1993, Bio Systems.

[8]  J. Claverie Computational methods for the identification of genes in vertebrate genomic sequences. , 1997, Human molecular genetics.

[9]  Rodger Staden,et al.  Measurements of the effects that coding for a protein has on a DNA sequence and their use for finding genes , 1984, Nucleic Acids Res..

[10]  J. C. Shepherd Method to determine the reading frame of a protein from the purine/pyrimidine genome sequence and its possible evolutionary justification. , 1981, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Wentian Li,et al.  The Study of Correlation Structures of DNA Sequences: A Critical Review , 1997, Comput. Chem..

[12]  A A Tsonis,et al.  Periodicity in DNA coding sequences: implications in gene evolution. , 1991, Journal of theoretical biology.

[13]  E. Uberbacher,et al.  Pattern recognition in DNA sequences: The intron-exon junction problem , 1990 .

[14]  J. Fickett,et al.  Assessment of protein coding measures. , 1992, Nucleic acids research.

[15]  T. Ikemura Codon usage and tRNA content in unicellular and multicellular organisms. , 1985, Molecular biology and evolution.

[16]  Mikhail S. Gelfand,et al.  Prediction of Function in DNA Sequence , 1995, J. Comput. Biol..

[17]  P Argos,et al.  Oligopeptide biases in protein sequences and their use in predicting protein coding regions in nucleotide sequences , 1988, Proteins.

[18]  V. Solovyev,et al.  Predicting internal exons by oligonucleotide composition and discriminant analysis of spliceable open reading frames. , 1994, Nucleic acids research.

[19]  R. Linsker,et al.  A measure of DNA periodicity. , 1986, Journal of theoretical biology.

[20]  B. Sproat,et al.  The synthesis of protected 5'-amino-2',5'-dideoxyribonucleoside-3'-O-phosphoramidites; applications of 5'-amino-oligodeoxyribonucleotides. , 1987, Nucleic acids research.

[21]  D. Searls,et al.  Gene structure prediction by linguistic methods. , 1994, Genomics.

[22]  J L Oliver,et al.  On the origin of the periodicity of three in protein coding DNA sequences. , 1994, Journal of theoretical biology.

[23]  R. Guigó,et al.  Evaluation of gene structure prediction programs. , 1996, Genomics.

[24]  M. Gribskov,et al.  The codon preference plot: graphic analysis of protein coding sequences and prediction of gene expression , 1984, Nucleic Acids Res..

[25]  J W Fickett,et al.  Finding genes by computer: the state of the art. , 1996, Trends in genetics : TIG.

[26]  E. Uberbacher,et al.  Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[27]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[28]  Michael Ruogu Zhang,et al.  Identification of protein coding regions in the human genome by quadratic discriminant analysis. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[29]  Chris A. Fields,et al.  gm: a practical tool for automating DNA sequence analysis , 1990, Comput. Appl. Biosci..

[30]  E. Snyder,et al.  Identification of coding regions in genomic DNA sequences: an application of dynamic programming and neural networks. , 1993, Nucleic acids research.

[31]  Mark Borodovsky,et al.  GENMARK: Parallel Gene Recognition for Both DNA Strands , 1993, Comput. Chem..

[32]  Wentian Li,et al.  Understanding long-range correlations in DNA sequences , 1994, chao-dyn/9403002.

[33]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[34]  Roderic Guigó,et al.  Computational Gene Identification: An Open Problem , 1997, Comput. Chem..

[35]  Michael R. Hayden,et al.  The prediction of exons through an analysis of spliceable open reading frames , 1992, Nucleic Acids Res..

[36]  M H Skolnick,et al.  A probabilistic model for detecting coding regions in DNA sequences. , 1994, IMA journal of mathematics applied in medicine and biology.

[37]  Steven Salzberg,et al.  Finding Genes in DNA with a Hidden Markov Model , 1997, J. Comput. Biol..

[38]  I. Grosse,et al.  MEASURING CORRELATIONS IN SYMBOL SEQUENCES , 1995 .