Detection of genes in Escherichia coli sequences determined by genome projects and prediction of protein production levels, based on multivariate diversity in codon usage

We used principal component analysis to develop measures (called Z-parameters in this study) which reflect the diversity of codon usage in Escherichia coli genes. Protein production levels for 1500 CDSs (protein-coding sequences) identified by E.coli genome projects in Japan and the US were estimated from a correlation equation between Z1 and cellular protein content obtained through analysis of the genes experimentally characterized. Through the profile analysis of Z1 for E.coli sequences obtained by the Japanese Project, we predicted an additional 36 CDSs that had not been annotated in the International DNA Database. Thirty-one out of the 36 CDSs could be assigned to presumptive protein genes through a BLASTX search for recent protein databases in the Genome Net in Japan. Detailed examination of the Z1-parameter profile led us to assess sequencing errors which cause frame-shift.

[1]  M. Gouy,et al.  Codon frequencies in 119 individual genes confirm consistent choices of degenerate bases according to genome type. , 1980, Nucleic acids research.

[2]  T. Ikemura Codon usage and tRNA content in unicellular and multicellular organisms. , 1985, Molecular biology and evolution.

[3]  H. Kaiser The Application of Electronic Computers to Factor Analysis , 1960 .

[4]  M. Gribskov,et al.  The codon preference plot: graphic analysis of protein coding sequences and prediction of gene expression , 1984, Nucleic Acids Res..

[5]  Toshimichi Ikemura,et al.  Codon usage tabulated from the international DNA sequence databases , 1997, Nucleic Acids Res..

[6]  F. Neidhardt,et al.  The gene‐protein database of Escherichia coli: Edition 4 , 1991, Electrophoresis.

[7]  Nobuyuki Fujita,et al.  Systematic sequencing of the Escherichia coli genome: analysis of the 0- 2.4 min region , 1992, Nucleic Acids Res..

[8]  F. Neidhardt,et al.  Patterns of protein synthesis in E. coli: a catalog of the amount of 140 individual proteins at different growth rates , 1978, Cell.

[9]  T Gojobori,et al.  Codon usage tabulated from the GenBank genetic sequence data. , 1991, Nucleic acids research.

[10]  T. Ikemura Correlation between the abundance of Escherichia coli transfer RNAs and the occurrence of the respective codons in its protein genes. , 1981, Journal of molecular biology.

[11]  M. Hill Correspondence Analysis: A Neglected Multivariate Method , 1974 .

[12]  F. Neidhardt,et al.  The gene‐protein database of Escherichia coli: Edition 5 , 1992, Electrophoresis.

[13]  T. Ikemura Correlation between the abundance of Escherichia coli transfer RNAs and the occurrence of the respective codons in its protein genes: a proposal for a synonymous codon choice that is optimal for the E. coli translational system. , 1981, Journal of molecular biology.

[14]  T. Ikemura Correlation between the abundance of yeast transfer RNAs and the occurrence of the respective codons in protein genes. Differences in synonymous codon choice patterns of yeast and Escherichia coli with reference to the abundance of isoaccepting transfer RNAs. , 1982, Journal of molecular biology.

[15]  L. Gleser Measurement, Regression, and Calibration , 1996 .

[16]  D. Haussler,et al.  A hidden Markov model that finds genes in E. coli DNA. , 1994, Nucleic acids research.

[17]  Paul J. Lewi,et al.  Spectral map analysis: Factorial analysis of contrasts, especially from log ratios , 1989 .

[18]  H. Mori,et al.  Systematic sequencing of the Escherichia coli genome: analysis of the 2.4-4.1 min (110,917-193,643 bp) region. , 1994, Nucleic acids research.

[19]  R. Carroll Measurement, Regression, and Calibration , 1994 .

[20]  T Gojobori,et al.  Codon usage tabulated from the GenBank Genetic Sequence Data. , 1988, Nucleic acids research.

[21]  P. Sharp,et al.  The codon Adaptation Index--a measure of directional synonymous codon usage bias, and its potential applications. , 1987, Nucleic acids research.

[22]  A Danchin,et al.  Colibri: a functional data base for the Escherichia coli genome. , 1993, Microbiological reviews.

[23]  E. Snyder,et al.  Identification of protein coding regions in genomic DNA. , 1995, Journal of molecular biology.

[24]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[25]  H. Westerhoff,et al.  The genes of the glutamine synthetase adenylylation cascade are not regulated by nitrogen in Escherichia coli , 1993, Molecular microbiology.

[26]  Mark Borodovsky,et al.  GENMARK: Parallel Gene Recognition for Both DNA Strands , 1993, Comput. Chem..

[27]  J. Fickett,et al.  Assessment of protein coding measures. , 1992, Nucleic acids research.

[28]  M. Gouy,et al.  Codon catalog usage and the genome hypothesis. , 1980, Nucleic acids research.

[29]  Manolo Gouy,et al.  Codon catalog usage is a genome strategy modulated for gene expressivity , 1981, Nucleic Acids Res..

[30]  Yi-tzuu T. Chien Interactive pattern recognition , 1978 .