GS-Finder: a program to find bacterial gene start sites with a self-training method.

In this paper, a self-training method is proposed to recognize translation start sites in bacterial genomes without a prior knowledge of rRNA in the genomes concerned. Many features with biological meanings are incorporated, including mononucleotide distribution patterns near the start codon, the start codon itself, the coding potential and the distance from the most-left start codon to the start codon. The proposed method correctly predicts 92% of the translation start sites of 195 experimentally confirmed Escherichia coli CDSs, 96% of 58 reliable Bacillus subtilis CDSs and 82% of 140 reliable Synechocystis CDSs. Moreover, the self-training method presented might also be used to relocate the translation start sites of putative CDSs of genomes, which are predicted by gene-finding programs. After post-processing by the method presented, the improvement of gene start prediction of some gene-finding programs is remarkable, e.g., the accuracy of gene start prediction of Glimmer 2.02 increases from 63 to 91% for 832 E. coli reliable CDSs. An open source computer program to implement the method, GS-Finder, is freely available for academic purposes from http://tubic.tju.edu.cn/GS-Finder/.

[1]  M. Borodovsky,et al.  GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. , 2001, Nucleic acids research.

[2]  M. Smit,et al.  Secondary structure of the ribosome binding site determines translational efficiency: a quantitative analysis. , 1990 .

[3]  H. Margalit,et al.  Identification and characterization of E.coli ribosomal binding sites by free energy computation. , 1993, Nucleic acids research.

[4]  O. Ohara,et al.  Cyano2Dbase updated: Linkage of 234 protein spots to corresponding genes through N‐terminal microsequencing , 1999 .

[5]  M. Borodovsky,et al.  GeneMark.hmm: new solutions for gene finding. , 1998, Nucleic acids research.

[6]  M. Kozak Initiation of translation in prokaryotes and eukaryotes. , 1999, Gene.

[7]  M. Borodovsky,et al.  Deriving ribosomal binding site (RBS) statistical models from unannotated DNA sequences and the use of the RBS model for N-terminal prediction. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[8]  T Yada,et al.  A novel bacterial gene-finding system with improved accuracy in locating start codons. , 2001, DNA research : an international journal for rapid publication of reports on genes and genomes.

[9]  Kenneth E. Rudd,et al.  EcoGene: a genome sequence database for Escherichia coli K-12 , 2000, Nucleic Acids Res..

[10]  R Zhang,et al.  Z curves, an intutive tool for visualizing and analyzing the DNA sequences. , 1994, Journal of biomolecular structure & dynamics.

[11]  C. Zhang,et al.  Identification of protein-coding genes in the genome of Vibrio cholerae with more than 98% accuracy using occurrence frequencies of single nucleotides. , 2001, European journal of biochemistry.

[12]  Martin Tompa,et al.  An Exact Method for Finding Short Motifs in Sequences, with Application to the Ribosome Binding Site Problem , 1999, ISMB.

[13]  I. Rigoutsos,et al.  Dictionary-driven prokaryotic gene finding. , 2002, Nucleic acids research.

[14]  M. Gelfand,et al.  Starts of bacterial genes: estimating the reliability of computer predictions. , 1999, Gene.

[15]  Masaru Tomita,et al.  Analysis of base-pairing potentials between 16S rRNA and 5' UTR for translation initiation in various prokaryotes , 1999, Bioinform..

[16]  George M. Church,et al.  Comparing the predicted and observed properties of proteins encoded in the genome of Escherichia coli K‐12 , 1997, Electrophoresis.

[17]  Simon Kasif,et al.  A comparative genomic method for computational identification of prokaryotic translation initiation sites. , 2002, Nucleic acids research.

[18]  S. Salzberg,et al.  Microbial gene identification using interpolated Markov models. , 1998, Nucleic acids research.

[19]  T. D. Schneider,et al.  Use of the 'Perceptron' algorithm to distinguish translational initiation sites in E. coli. , 1982, Nucleic acids research.

[20]  Feng-Biao Guo,et al.  ZCURVE: a new system for recognizing protein-coding genes in bacterial and archaeal genomes. , 2003, Nucleic acids research.

[21]  J W Fickett,et al.  Bacterial start site prediction. , 1999, Nucleic acids research.

[22]  Mikhail S. Gelfand,et al.  Combining diverse evidence for gene recognition in completely sequenced bacterial genomes , 1998, German Conference on Bioinformatics.

[23]  S. Salzberg,et al.  Improved microbial gene identification with GLIMMER. , 1999, Nucleic acids research.

[24]  A Danchin,et al.  Translation in Bacillus subtilis: roles and trends of initiation and termination, insights from a genome analysis. , 1999, Nucleic acids research.

[25]  Steven Salzberg,et al.  A probabilistic method for identifying start codons in bacterial genomes , 2001, Bioinform..

[26]  Felix L. Chernousko,et al.  Finding prokaryotic genes by the 'frame-by-frame' algorithm: targeting gene starts and overlapping genes , 1999, Bioinform..

[27]  Mark Borodovsky,et al.  GENMARK: Parallel Gene Recognition for Both DNA Strands , 1993, Comput. Chem..