A novel bacterial gene-finding system with improved accuracy in locating start codons.

Although a number of bacterial gene-finding programs have been developed, there is still room for improvement especially in the area of correctly detecting translation start sites. We developed a novel bacterial gene-finding program named GeneHacker Plus. Like many others, it is based on a hidden Markov model (HMM) with duration. However, it is a 'local' model in the sense that the model starts from the translation control region and ends at the stop codon of a coding region. Multiple coding regions are identified as partial paths, like local alignments in the Smith-Waterman algorithm, regardless of how they overlap. Moreover, our semiautomatic procedure for constructing the model of the translation control region allows the inclusion of an additional conserved element as well as the ribosome-binding site. We confirmed that GeneHacker Plus is one of the most accurate programs in terms of both finding potential coding regions and precisely locating translation start sites. GeneHacker Plus is also equipped with an option where the results from database homology searches are directly embedded in the HMM. Although this option does not raise the overall predictability, labeled similarity information can be of practical use. GeneHacker Plus can be accessed freely at http://elmo.ims.u-tokyo.ac.jp/GH/.

[1]  M. Kozak Initiation of translation in prokaryotes and eukaryotes. , 1999, Gene.

[2]  A Danchin,et al.  SubtiList: a relational database for the Bacillus subtilis genome. , 1995, Microbiology.

[3]  O. Ohara,et al.  Cyano2Dbase updated: Linkage of 234 protein spots to corresponding genes through N‐terminal microsequencing , 1999 .

[4]  R. Fleischmann,et al.  The complete genome sequence of the hyperthermophilic, sulphate-reducing archaeon Archaeoglobus fulgidus , 1997, Nature.

[5]  Felix L. Chernousko,et al.  Finding prokaryotic genes by the 'frame-by-frame' algorithm: targeting gene starts and overlapping genes , 1999, Bioinform..

[6]  M. Borodovsky,et al.  Deriving ribosomal binding site (RBS) statistical models from unannotated DNA sequences and the use of the RBS model for N-terminal prediction. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[7]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[8]  A. Krogh,et al.  Using database matches with for HMMGene for automated gene detection in Drosophila. , 2000, Genome research.

[9]  G. Schuler,et al.  Entrez: molecular biology database and retrieval system. , 1996, Methods in enzymology.

[10]  Y. Nakamura,et al.  Sequence analysis of the genome of the unicellular cyanobacterium Synechocystis sp. strain PCC6803. II. Sequence determination of the entire genome and assignment of potential protein-coding regions (supplement). , 1996, DNA research : an international journal for rapid publication of reports on genes and genomes.

[11]  J. Fickett,et al.  Assessment of protein coding measures. , 1992, Nucleic acids research.

[12]  A. Goffeau,et al.  The complete genome sequence of the Gram-positive bacterium Bacillus subtilis , 1997, Nature.

[13]  N. Ogasawara,et al.  Sequencing of three lambda clones from the genome of alkaliphilic Bacillus sp. strain C-125 , 1999, Extremophiles.

[14]  George M. Church,et al.  Large scale bacterial gene discovery by similarity search , 1994, Nature Genetics.

[15]  Mikhail S. Gelfand,et al.  Combining diverse evidence for gene recognition in completely sequenced bacterial genomes , 1998, German Conference on Bioinformatics.

[16]  N. W. Davis,et al.  The complete genome sequence of Escherichia coli K-12. , 1997, Science.

[17]  S. Salzberg,et al.  Improved microbial gene identification with GLIMMER. , 1999, Nucleic acids research.

[18]  Mark Borodovsky,et al.  The complete genome sequence of the gastric pathogen Helicobacter pylori , 1997, Nature.

[19]  D. Haussler,et al.  A hidden Markov model that finds genes in E. coli DNA. , 1994, Nucleic acids research.

[20]  S. Salzberg,et al.  Microbial gene identification using interpolated Markov models. , 1998, Nucleic acids research.

[21]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[22]  M. Hattori,et al.  Genome sequence of the endocellular bacterial symbiont of aphids Buchnera sp. APS , 2000, Nature.

[23]  D. Haussler,et al.  Genie--gene finding in Drosophila melanogaster. , 2000, Genome research.

[24]  Martin Tompa,et al.  An Exact Method for Finding Short Motifs in Sequences, with Application to the Ribosome Binding Site Problem , 1999, ISMB.

[25]  P. Pevzner,et al.  Gene recognition via spliced sequence alignment. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[26]  George M. Church,et al.  Comparing the predicted and observed properties of proteins encoded in the genome of Escherichia coli K‐12 , 1997, Electrophoresis.

[27]  Mark Borodovsky,et al.  GENMARK: Parallel Gene Recognition for Both DNA Strands , 1993, Comput. Chem..

[28]  Masato Ishikawa,et al.  Comprehensive study on iterative algorithms of multiple sequence alignment , 1995, Comput. Appl. Biosci..

[29]  Sayaka,et al.  Sequence analysis of the genome of the unicellular cyanobacterium Synechocystis sp. strain PCC6803. II. Sequence determination of the entire genome and assignment of potential protein-coding regions. , 1996, DNA research : an international journal for rapid publication of reports on genes and genomes.

[30]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 , 2000, Nucleic Acids Res..

[31]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[32]  J W Fickett,et al.  Bacterial start site prediction. , 1999, Nucleic acids research.

[33]  T Yada,et al.  Detection of short protein coding regions within the cyanobacterium genome: application of the hidden Markov model. , 1996, DNA research : an international journal for rapid publication of reports on genes and genomes.

[34]  Steven Salzberg,et al.  A method for identifying splice sites and translational start sites in eukaryotic mRNA , 1997, Comput. Appl. Biosci..

[35]  M. Borodovsky,et al.  Heuristic approach to deriving models for gene finding. , 1999, Nucleic acids research.

[36]  M. Borodovsky,et al.  GeneMark.hmm: new solutions for gene finding. , 1998, Nucleic acids research.

[37]  G. Church,et al.  Complete genome sequence of Methanobacterium thermoautotrophicum deltaH: functional analysis and comparative genomics , 1997, Journal of bacteriology.

[38]  Y. Nakamura,et al.  Complete genome sequence of the alkaliphilic bacterium Bacillus halodurans and genomic sequence comparison with Bacillus subtilis. , 2000, Nucleic acids research.