Large scale bacterial gene discovery by similarity search

DNA sequencing efforts frequently uncover genes other than the targeted ones. We have used rapid database scanning methods to search for undescribed eubacterial and archean protein coding frames in regions flanking known genes. By searching all prokaryotic DNA sequences not marked as coding for proteins or stable RNAs against the protein databases, we have identified more than 450 new examples of bacterial proteins, as well as a smaller number of possible revisions to known proteins, at a surprisingly high rate of one new protein or revision for every 24 initial DNA sequences or 8,300 nucleotides examined. Seven proteins are members of families which have not been described in prokaryotic sequences. We also describe 49 re–interpretations of existing sequence data of particular biological significance.

[1]  A. Pühler,et al.  Genetics of xanthan production in Xanthomonas campestris: the xanA and xanB genes are involved in UDP-glucose and GDP-mannose biosynthesis , 1992, Journal of bacteriology.

[2]  A. Böck,et al.  Organisation and Nucleotide Sequence of a Gene Cluster Comprising the Translation Elongation Factor 1α from Sulfolobus acidocaldarius , 1991 .

[3]  H. Nielsen,et al.  An intron in a ribosomal protein gene from Tetrahymena , 1986, The EMBO journal.

[4]  S. Osawa,et al.  Recent evidence for evolution of the genetic code , 1992, Microbiological reviews.

[5]  P. Babbitt,et al.  Analysis of sequence homologies in plant and bacterial pyruvate phosphate dikinase, enzyme I of the bacterial phosphoenolpyruvate: sugar phosphotransferase system and other PEP-utilizing enzymes. Identification of potential catalytic and regulatory motifs. , 1990, Biochemistry.

[6]  J. Fickett,et al.  Assessment of protein coding measures. , 1992, Nucleic acids research.

[7]  A. Chakrabarty,et al.  Characterization and regulation of the Pseudomonas aeruginosa algC gene encoding phosphomannomutase. , 1991, The Journal of biological chemistry.

[8]  R. Weiss,et al.  Recoding: reprogrammed genetic decoding. , 1992, Science.

[9]  R. Morona,et al.  Serotype conversion in Vibrio cholerae O1. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[10]  L. Alksne,et al.  A novel cloning strategy reveals the gene for the yeast homologue to Escherichia coli ribosomal protein S12. , 1993, The Journal of biological chemistry.

[11]  M. O. Dayhoff,et al.  Atlas of protein sequence and structure , 1965 .

[12]  Ross A. Overbeek,et al.  The ribosomal database project , 1992, Nucleic Acids Res..

[13]  A Araya,et al.  Direct protein sequencing of wheat mitochondrial ATP synthase subunit 9 confirms RNA editing in plants. , 1990, Journal of molecular biology.

[14]  R. Garrett,et al.  Sequence, organization, transcription and evolution of RNA polymerase subunit genes from the archaebacterial extreme halophiles Halobacterium halobium and Halococcus morrhuae. , 1989, Journal of molecular biology.

[15]  A. Bairoch,et al.  The SWISS-PROT protein sequence data bank. , 1991, Nucleic acids research.

[16]  T Kristensen,et al.  An estimate of the sequencing error frequency in the DNA sequence databases. , 1992, DNA sequence : the journal of DNA sequencing and mapping.

[17]  David J. States,et al.  Identification of protein coding regions by database similarity search , 1993, Nature Genetics.

[18]  Y. Ozeki,et al.  Primary structure of maize pyruvate, orthophosphate dikinase as deduced from cDNA sequence. , 1988, The Journal of biological chemistry.

[19]  I. Wool,et al.  The primary structure of rat ribosomal protein S5. A ribosomal protein present in the rat genome in a single copy. , 1992, The Journal of biological chemistry.

[20]  W. Zillig,et al.  Organization and nucleotide sequence of the genes encoding the large subunits A, B and C of the DNA-dependent RNA polymerase of the archaebacterium Sulfolobus acidocaldarius. , 1989, Nucleic acids research.

[21]  S. Trevanion,et al.  Pyrophosphate-dependent phosphofructokinase. Conservation of protein sequence between the alpha- and beta-subunits and with the ATP-dependent phosphofructokinase. , 1990, The Journal of biological chemistry.

[22]  S A Krawetz Sequence errors described in GenBank: a means to determine the accuracy of DNA sequence interpretation. , 1989, Nucleic acids research.

[23]  D J States,et al.  Molecular sequence accuracy and the analysis of protein coding regions. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[24]  A. Klein,et al.  Cloning and characterization of the methyl coenzyme M reductase genes from Methanobacterium thermoautotrophicum , 1988, Journal of bacteriology.

[25]  P. Reeves,et al.  Sequence and structural analysis of the rfb (O antigen) gene cluster from a group C1 Salmonella enterica strain. , 1992, Journal of general microbiology.

[26]  J. Claverie,et al.  Identifying coding exons by similarity search: alu-derived and other potentially misleading protein sequences. , 1992, Genomics.

[27]  R J Roberts,et al.  Finding errors in DNA sequences. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[28]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[29]  T. Cech,et al.  RNA editing: World's smallest introns? , 1991, Cell.

[30]  L T Hunt,et al.  The PIR protein sequence database. , 1991, Nucleic acids research.

[31]  S. Altschul Amino acid substitution matrices from an information theoretic perspective , 1991, Journal of Molecular Biology.

[32]  J. Reeve,et al.  Structure and expression of the genes, mcrBDCGA, which encode the subunits of component C of methyl coenzyme M reductase in Methanococcus vannielii. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[33]  M. Bokranz,et al.  Nucleotide sequence of the methyl coenzyme M reductase gene cluster from Methanosarcina barkeri. , 1987, Nucleic acids research.

[34]  T. D. Schneider,et al.  Use of the 'Perceptron' algorithm to distinguish translational initiation sites in E. coli. , 1982, Nucleic acids research.

[35]  Rainer Fuchs,et al.  CLUSTAL V: improved software for multiple sequence alignment , 1992, Comput. Appl. Biosci..

[36]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[37]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[38]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[39]  Harvard Medical School,et al.  Characterization of the cobalamin (vitamin B12) biosynthetic genes of Salmonella typhimurium , 1993, Journal of bacteriology.

[40]  C. Raetz,et al.  A novel 3-deoxy-D-manno-octulosonic acid transferase from Chlamydia trachomatis required for expression of the genus-specific epitope. , 1992, The Journal of biological chemistry.