MICheck: a web tool for fast checking of syntactic annotations of bacterial genomes

The annotation of newly sequenced bacterial genomes begins with running several automatic analysis methods, with major emphasis on the identification of protein-coding genes. DNA sequences are heterogeneous in local nucleotide composition and this leads sometimes to sequences being annotated as authentic genes when they are not protein-coding genes or are true but uncharacterized protein-coding genes. This first annotation step is generally followed by an expert manual annotation of the predicted genes. The genomic data (sequence and annotations) organized in an appropriate databank file format is subsequently submitted to an entry point of the International Nucleotide Sequence Database. These procedures are inevitably subject to mistakes, and this can lead to unintentional syntactic annotation errors being stored in public databanks. Here, we present a new web program, MICheck (MIcrobial genome Checker), that enables rapid verification of sets of annotated genes and frameshifts in previously published bacterial genomes. The web interface allows one easily to investigate the MICheck results, i.e. inaccurate or missed gene annotations: a graphical representation is drawn, in which the genomic context of a unique coding DNA sequence annotation or a predicted frameshift is given, using information on the coding potential (curves) and annotation of the neighbouring genes. We illustrate some capabilities of the MICheck site through the analysis of 20 bacterial genomes, 9 of which were selected for their ‘Reviewed’ status in the National Center for Biotechnology Information (NCBI) Reference Sequence Project (RefSeq). In the context of the numerous re-annotation projects for microbial genomes, this tool can be seen as a preliminary step before the functional re-annotation step to check quickly for missing or wrongly annotated genes. The MICheck website is accessible at the following address: .

[1]  Miguel A. Andrade-Navarro,et al.  Evaluation of annotation strategies using an entire genome sequence , 2003, Bioinform..

[2]  M. Borodovsky,et al.  How to interpret an anonymous bacterial genome: machine learning approach to gene identification. , 1998, Genome research.

[3]  David W Ussery,et al.  Genome Update: annotation quality in sequenced microbial genomes. , 2004, Microbiology.

[4]  A Danchin,et al.  Detecting and analyzing DNA sequencing errors: toward a higher quality of the Bacillus subtilis genome sequence. , 1999, Genome research.

[5]  Kim Rutherford,et al.  Artemis: sequence visualization and annotation , 2000, Bioinform..

[6]  Emily Dimmer,et al.  The Gene Ontology Annotation (GOA) Database - An integrated resource of GO annotations to the UniProt Knowledgebase , 2003, Silico Biol..

[7]  Jean-Jacques Codani,et al.  LASSAP, a LArge Scale Sequence compArison Package , 1997, Comput. Appl. Biosci..

[8]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[9]  Alex Bateman,et al.  The InterPro Database, 2003 brings increased coverage and new features , 2003, Nucleic Acids Res..

[10]  Mark Borodovsky,et al.  GENMARK: Parallel Gene Recognition for Both DNA Strands , 1993, Comput. Chem..

[11]  Peter D Karp,et al.  The past, present and future of genome-wide re-annotation , 2002, Genome Biology.

[12]  A. Valencia,et al.  Intrinsic errors in genome annotation. , 2001, Trends in genetics : TIG.

[13]  M. Borodovsky,et al.  GeneMark.hmm: new solutions for gene finding. , 1998, Nucleic acids research.

[14]  Grégory Nuel,et al.  AMIGene: Annotation of MIcrobial Genes , 2003, Nucleic Acids Res..

[15]  Tatiana A. Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[16]  Antoine Danchin,et al.  Re-annotation of genome microbial CoDing-Sequences: finding new genes and inaccurately annotated genes , 2002, BMC Bioinformatics.

[17]  Anders Krogh,et al.  EasyGene – a prokaryotic gene finder that ranks ORFs by statistical significance , 2003, BMC Bioinformatics.

[18]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[19]  Rajeev K. Azad,et al.  Effects of choice of DNA sequence model structure on gene identification accuracy , 2004, Bioinform..