An unsupervised classification scheme for improving predictions of prokaryotic TIS

BackgroundAlthough it is not difficult for state-of-the-art gene finders to identify coding regions in prokaryotic genomes, exact prediction of the corresponding translation initiation sites (TIS) is still a challenging problem. Recently a number of post-processing tools have been proposed for improving the annotation of prokaryotic TIS. However, inherent difficulties of these approaches arise from the considerable variation of TIS characteristics across different species. Therefore prior assumptions about the properties of prokaryotic gene starts may cause suboptimal predictions for newly sequenced genomes with TIS signals differing from those of well-investigated genomes.ResultsWe introduce a clustering algorithm for completely unsupervised scoring of potential TIS, based on positionally smoothed probability matrices. The algorithm requires an initial gene prediction and the genomic sequence of the organism to perform the reannotation. As compared with other methods for improving predictions of gene starts in bacterial genomes, our approach is not based on any specific assumptions about prokaryotic TIS. Despite the generality of the underlying algorithm, the prediction rate of our method is competitive on experimentally verified test data from E. coli and B. subtilis. Regarding genomes with high G+C content, in contrast to some previously proposed methods, our algorithm also provides good performance on P. aeruginosa, B. pseudomallei and R. solanacearum.ConclusionOn reliable test data we showed that our method provides good results in post-processing the predictions of the widely-used program GLIMMER. The underlying clustering algorithm is robust with respect to variations in the initial TIS annotation and does not require specific assumptions about prokaryotic gene starts. These features are particularly useful on genomes with high G+C content. The algorithm has been implemented in the tool »TICO«(TIs COrrector) which is publicly available from our web site.

[1]  Feng-Biao Guo,et al.  ZCURVE: a new system for recognizing protein-coding genes in bacterial and archaeal genomes. , 2003, Nucleic acids research.

[2]  S. Salzberg,et al.  Improved microbial gene identification with GLIMMER. , 1999, Nucleic acids research.

[3]  Burkhard Morgenstern,et al.  TICO: a tool for improving predictions of prokaryotic translation initiation sites , 2005, Bioinform..

[4]  J W Fickett,et al.  Bacterial start site prediction. , 1999, Nucleic acids research.

[5]  Thomas Schiex,et al.  FrameD: a flexible program for quality check and gene prediction in prokaryotic genomes and noisy matured eukaryotic sequences , 2003, Nucleic Acids Res..

[6]  Martin Tompa,et al.  An Exact Method for Finding Short Motifs in Sequences, with Application to the Ribosome Binding Site Problem , 1999, ISMB.

[7]  Jin Wang,et al.  Accuracy improvement for identifying translation initiation sites in microbial genomes , 2004, Bioinform..

[8]  Feng-Biao Guo,et al.  GS-Finder: a program to find bacterial gene start sites with a self-training method. , 2004, The international journal of biochemistry & cell biology.

[9]  S. Lory,et al.  Complete genome sequence of Pseudomonas aeruginosa PAO1, an opportunistic pathogen , 2000, Nature.

[10]  J. Weissenbach,et al.  Genome sequence of the plant pathogen Ralstonia solanacearum , 2002, Nature.

[11]  George M. Church,et al.  Comparing the predicted and observed properties of proteins encoded in the genome of Escherichia coli K‐12 , 1997, Electrophoresis.

[12]  Steven Salzberg,et al.  A probabilistic method for identifying start codons in bacterial genomes , 2001, Bioinform..

[13]  Akio Utsugi,et al.  Density Estimation by Mixture Models with Smoothing Priors , 1998, Neural Computation.

[14]  Kim Rutherford,et al.  Genomic plasticity of the causative agent of melioidosis, Burkholderia pseudomallei. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[15]  Rainer Merkl,et al.  YACOP: Enhanced gene prediction obtained by a combination of existing methods , 2003, Silico Biol..

[16]  Kenneth E. Rudd,et al.  EcoGene: a genome sequence database for Escherichia coli K-12 , 2000, Nucleic Acids Res..

[17]  J A Swets,et al.  Measuring the accuracy of diagnostic systems. , 1988, Science.

[18]  A. Goffeau,et al.  The complete genome sequence of the Gram-positive bacterium Bacillus subtilis , 1997, Nature.

[19]  J. Shine,et al.  The 3'-terminal sequence of Escherichia coli 16S ribosomal RNA: complementarity to nonsense triplets and ribosome binding sites. , 1974, Proceedings of the National Academy of Sciences of the United States of America.

[20]  James P. Egan,et al.  Signal detection theory and ROC analysis , 1975 .

[21]  Mikhail S. Gelfand,et al.  Combining diverse evidence for gene recognition in completely sequenced bacterial genomes , 1998, German Conference on Bioinformatics.

[22]  N. W. Davis,et al.  The complete genome sequence of Escherichia coli K-12. , 1997, Science.

[23]  M. Borodovsky,et al.  GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. , 2001, Nucleic acids research.

[24]  Mark Borodovsky,et al.  GENMARK: Parallel Gene Recognition for Both DNA Strands , 1993, Comput. Chem..

[25]  Rainer Merkl,et al.  Oligo kernels for datamining on biological sequences: a case study on prokaryotic translation initiation sites , 2004, BMC Bioinformatics.

[26]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[27]  T Yada,et al.  A novel bacterial gene-finding system with improved accuracy in locating start codons. , 2001, DNA research : an international journal for rapid publication of reports on genes and genomes.