Annotation of bacterial and archaeal genomes: improving accuracy and consistency.

ly, annotating a genome amounts to attaching information to support use of the genome. This includes an almost endless variety of types of analysis and attachment of interpretations. In our experience, it has proven useful to prioritize the analyses, and the following items provide at least a reasonable working notion of what is meant: 1. Genes are identified. This effort includes at least protein-encoding genes and some of the RNA-encoding genes (often just tRNAs and rRNAs). 2. The functions of genes are predicted. 3. Metabolic reconstructions are developed and tied to the specific genes. 4. Prophages, insertion sequences, and transposons are labeled. 5. Frameshifts and pseudogenes are predicted. 6. Regulatory sites and operons are identified as a step toward developing an inventory of regulons. In practice, usually only the locations of genes and their predicted functions are provided by the initial annotation effort. Accordingly the first part of this review deals with the status of gene identification in prokaryotes, and the second part deals with the task of predicting the function to be associated with protein-coding genes. 2. Gene Prediction in Bacteria and Archaea Once the sequence of a prokaryotic genome has been determined, the next step is the definition of the functional Figure 1. Growth of available genomes and SwissProt annotations. While the primary sequence repository (GenBank1) doubles in size every 18 months, high-quality annotations (we take SwissProt2 as an example) cannot keep up with this growth. The graph compares the growth on a logarithmic scale. Figure 2. How annotations are done. This diagram is intended to convey the interactions between the different types of activities that make up the annotation process (blue). A key point is that maintenance and improvement of annotations originate in expert analysis and are reflected through protein family curation, since the “curate genomes” activity is seldom done. Annotation of Bacterial and Archaeal Genomes Chemical Reviews, 2007, Vol. 107, No. 8 3433