Correcting Inconsistencies and Errors in Bacterial Genome Metadata Using an Automated Curation Tool in Excel (AutoCurE)

Whole-genome data are invaluable for large-scale comparative genomic studies. Current sequencing technologies have made it feasible to sequence entire bacterial genomes with relative ease and time with a substantially reduced cost per nucleotide, hence cost per genome. More than 3,000 bacterial genomes have been sequenced and are available at the finished status. Publically available genomes can be readily downloaded; however, there are challenges to verify the specific supporting data contained within the download and to identify errors and inconsistencies that may be present within the organizational data content and metadata. AutoCurE, an automated tool for bacterial genome database curation in Excel, was developed to facilitate local database curation of supporting data that accompany downloaded genomes from the National Center for Biotechnology Information. AutoCurE provides an automated approach to curate local genomic databases by flagging inconsistencies or errors by comparing the downloaded supporting data to the genome reports to verify genome name, RefSeq accession numbers, the presence of archaea, BioProject/UIDs, and sequence file descriptions. Flags are generated for nine metadata fields if there are inconsistencies between the downloaded genomes and genomes reports and if erroneous or missing data are evident. AutoCurE is an easy-to-use tool for local database curation for large-scale genome data prior to downstream analyses.

[1]  Ying Cheng,et al.  Major submissions tool developments at the European nucleotide archive , 2011, Nucleic Acids Res..

[2]  Peter Williams,et al.  IMG: the integrated microbial genomes database and comparative analysis system , 2011, Nucleic Acids Res..

[3]  R. Fleischmann,et al.  Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. , 1995, Science.

[4]  Takashi Gojobori,et al.  The DNA Data Bank of Japan launches a new resource, the DDBJ Omics Archive of functional genomics experiments , 2011, Nucleic Acids Res..

[5]  Guy Cochrane,et al.  Toward richer metadata for microbial sequences: replacing strain-level NCBI taxonomy taxids with BioProject, BioSample and Assembly records , 2014, Standards in genomic sciences.

[6]  Guy Cochrane,et al.  The International Nucleotide Sequence Database Collaboration , 2011, Nucleic Acids Res..

[7]  C. Fraser-Liggett,et al.  Insights on biology and evolution from microbial genome sequencing. , 2005, Genome research.

[8]  Yan Zhang,et al.  PATRIC, the bacterial bioinformatics database and analysis resource , 2013, Nucleic Acids Res..

[9]  David L. Wheeler,et al.  GenBank , 2015, Nucleic Acids Res..

[10]  Matthew R. Laird,et al.  MicrobeDB: a locally maintainable database of microbial genomic sequences , 2012, Bioinform..

[11]  C. Fraser,et al.  Microbial genome sequencing 2000: new insights into physiology, evolution and expression analysis. , 2000, Research in microbiology.

[12]  G. Cochrane,et al.  The International Nucleotide Sequence Database Collaboration , 2011, Nucleic Acids Res..

[13]  Nikos Kyrpides,et al.  The Genomes OnLine Database (GOLD) v.5: a metadata management system based on a four level (meta)genome project classification , 2014, Nucleic Acids Res..

[14]  Jonathan A. Eisen,et al.  Microbial genome sequencing , 2000, Nature.