The automatic annotation of bacterial genomes

With the development of ultra-high-throughput technologies, the cost of sequencing bacterial genomes has been vastly reduced. As more genomes are sequenced, less time can be spent manually annotating those genomes, resulting in an increased reliance on automatic annotation pipelines. However, automatic pipelines can produce inaccurate genome annotation and their results often require manual curation. Here, we discuss the automatic and manual annotation of bacterial genomes, identify common problems introduced by the current genome annotation process and suggests potential solutions.

[1]  M. Metzker Sequencing technologies — the next generation , 2010, Nature Reviews Genetics.

[2]  Claudine Médigue,et al.  Re-annotation of the genome sequence of Mycobacterium tuberculosis H37Rv. , 2002, Microbiology.

[3]  Paul Stothard,et al.  Automated bacterial genome analysis and annotation. , 2006, Current opinion in microbiology.

[4]  W. Pearson Rapid and sensitive sequence comparison with FASTP and FASTA. , 1990, Methods in enzymology.

[5]  J Hacker,et al.  Pathogenicity islands of virulent bacteria: structure, function and impact on microbial evolution , 1997, Molecular microbiology.

[6]  Hiroyuki Ogata,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 1999, Nucleic Acids Res..

[7]  Rachael P. Huntley,et al.  The GOA database in 2009—an integrated Gene Ontology Annotation resource , 2008, Nucleic Acids Res..

[8]  Huaiqiu Zhu,et al.  Genome reannotation of Escherichia coli CFT073 with new insights into virulence , 2009, BMC Genomics.

[9]  Amos Bairoch,et al.  PROSITE, a protein domain database for functional characterization and annotation , 2009, Nucleic Acids Res..

[10]  Jaques Reifman,et al.  AGeS: A Software System for Microbial Genome Sequence Annotation , 2011, PloS one.

[11]  Nikos Kyrpides,et al.  CRISPR Recognition Tool (CRT): a tool for automatic detection of clustered regularly interspaced palindromic repeats , 2007, BMC Bioinformatics.

[12]  Steven Salzberg,et al.  A probabilistic method for identifying start codons in bacterial genomes , 2001, Bioinform..

[13]  R. Overbeek,et al.  The use of gene clusters to infer functional coupling. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Jenn-Kang Hwang,et al.  Predicting subcellular localization of proteins for Gram‐negative bacteria by support vector machines based on n‐peptide compositions , 2004, Protein science : a publication of the Protein Society.

[15]  Christoph Dehio,et al.  Signature-tagged mutagenesis: technical advances in a negative selection method for virulence gene identification. , 2005, Current opinion in microbiology.

[16]  S. Salzberg,et al.  Improved microbial gene identification with GLIMMER. , 1999, Nucleic acids research.

[17]  Deborah Hix,et al.  PATRIC: The VBI PathoSystems Resource Integration Center , 2006, Nucleic Acids Res..

[18]  Terri K. Attwood,et al.  PRINTS and its automatic supplement, prePRINTS , 2003, Nucleic Acids Res..

[19]  Jaques Reifman,et al.  The development of PIPA: an integrated and automated pipeline for genome-wide protein function annotation , 2008, BMC Bioinformatics.

[20]  Kenneth E. Rudd,et al.  Linkage Map of Escherichia coli K-12, Edition 10: The Physical Map , 1998, Microbiology and Molecular Biology Reviews.

[21]  E. Koonin Orthologs, paralogs, and evolutionary genomics. , 2005, Annual review of genetics.

[22]  Andrew C. Stewart,et al.  DIYA: a bacterial annotation pipeline for any genomics lab , 2009, Bioinform..

[23]  C. Médigue,et al.  MaGe: a microbial genome annotation system supported by synteny results , 2006, Nucleic acids research.

[24]  Martin Ester,et al.  Sequence analysis PSORTb v . 2 . 0 : Expanded prediction of bacterial protein subcellular localization and insights gained from comparative proteome analysis , 2004 .

[25]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[26]  Alan J. Cann Genomes , 2012, Principles of Molecular Virology.

[27]  R. Barrangou,et al.  CRISPR Provides Acquired Resistance Against Viruses in Prokaryotes , 2007, Science.

[28]  J. Gardy,et al.  Methods for predicting bacterial protein subcellular localization , 2006, Nature Reviews Microbiology.

[29]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..

[30]  E. Koonin Orthologs, Paralogs, and Evolutionary Genomics 1 , 2005 .

[31]  Jonathan D. G. Jones,et al.  Application of 'next-generation' sequencing technologies to microbial genetics , 2009, Nature Reviews Microbiology.

[32]  María Martín,et al.  Ongoing and future developments at the Universal Protein Resource , 2010, Nucleic Acids Res..

[33]  Duane Szafron,et al.  BASys: a web server for automated bacterial genome annotation , 2005, Nucleic Acids Res..

[34]  Fiona S. L. Brinkman,et al.  Evaluation of genomic island predictors using a comparative genomics approach , 2008, BMC Bioinformatics.

[35]  Christos A. Ouzounis,et al.  Genome coverage, literally speaking , 2005 .

[36]  Sarah A Teichmann,et al.  Relative rates of gene fusion and fission in multi-domain proteins. , 2005, Trends in genetics : TIG.

[37]  Rick L. Stevens,et al.  The RAST Server: Rapid Annotations using Subsystems Technology , 2008, BMC Genomics.

[38]  Zhiyong Lu,et al.  Predicting subcellular localization of proteins using machine-learned classifiers , 2004, Bioinform..

[39]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[40]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[41]  Tibor Vellai,et al.  Distribution and evolution of short tandem repeats in closely related bacterial genomes. , 2008, Gene.

[42]  Leon Goldovsky,et al.  Genome coverage, literally speaking. The challenge of annotating 200 genomes with 4 million publications. , 2005, EMBO reports.

[43]  Claudine Médigue,et al.  MICheck: a web tool for fast checking of syntactic annotations of bacterial genomes , 2005, Nucleic Acids Res..

[44]  Hampapathalu A. Nagarajaram,et al.  MICdb: database of prokaryotic microsatellites , 2003, Nucleic Acids Res..

[45]  Steven J. M. Jones,et al.  IslandPath: aiding detection of genomic islands in prokaryotes , 2003, Bioinform..

[46]  Erik L. L. Sonnhammer,et al.  Domain architecture conservation in orthologs , 2011, BMC Bioinformatics.

[47]  Julian Parkhill,et al.  Re-annotation and re-analysis of the Campylobacter jejuni NCTC11168 genome sequence , 2007, BMC Genomics.

[48]  S. Salzberg,et al.  Prediction of transcription terminators in bacterial genomes. , 2000, Journal of molecular biology.

[49]  C. Ouzounis,et al.  Percolation of annotation errors through hierarchically structured protein sequence databases. , 2005, Mathematical biosciences.

[50]  Carsten Damm,et al.  Score-based prediction of genomic islands in prokaryotic genomes using hidden Markov models , 2006, BMC Bioinformatics.

[51]  Kiejung Park,et al.  WeGAS: A Web-Based Microbial Genome Annotation System , 2009, Bioscience, biotechnology, and biochemistry.

[52]  Robert D. Finn,et al.  DUFs: families in search of function , 2010, Acta crystallographica. Section F, Structural biology and crystallization communications.

[53]  Zhirong Sun,et al.  Support vector machine approach for protein subcellular localization prediction , 2001, Bioinform..

[54]  Ibtissem Grissa,et al.  CRISPRFinder: a web tool to identify clustered regularly interspaced short palindromic repeats , 2007, Nucleic Acids Res..

[55]  J. Do,et al.  Computational approaches to gene prediction. , 2006, Journal of microbiology.

[56]  N. Mulder,et al.  InterPro and InterProScan: tools for protein sequence classification and comparison. , 2007, Methods in molecular biology.

[57]  Mikhail S. Gelfand,et al.  Combining diverse evidence for gene recognition in completely sequenced bacterial genomes , 1998, German Conference on Bioinformatics.

[58]  Wing-Kin Sung,et al.  Protein subcellular localization prediction for Gram-negative bacteria using amino acid subalphabets and a combination of multiple support vector machines , 2005, BMC Bioinformatics.

[59]  Arcady R. Mushegian,et al.  Computational methods for Gene Orthology inference , 2011, Briefings Bioinform..

[60]  Erik L. L. Sonnhammer,et al.  PfamAlyzer: domain-centric homology search , 2007, Bioinform..

[61]  Michael Watson,et al.  ProGenExpress: Visualization of quantitative data on prokaryotic genomes , 2005, BMC Bioinformatics.

[62]  Carole A. Goble,et al.  BioCatalogue: a universal catalogue of web services for the life sciences , 2010, Nucleic Acids Res..

[63]  S. Eddy,et al.  tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. , 1997, Nucleic acids research.

[64]  G. Olsen,et al.  CRITICA: coding region identification tool invoking comparative analysis. , 1999, Molecular biology and evolution.

[65]  Rolf Apweiler,et al.  InterPro and InterProScan , 2007 .