Toward a standard in structural genome annotation for prokaryotes

BackgroundIn an effort to identify the best practice for finding genes in prokaryotic genomes and propose it as a standard for automated annotation pipelines, 1,004,576 peptides were collected from various publicly available resources, and were used as a basis to evaluate various gene-calling methods. The peptides came from 45 bacterial replicons with an average GC content from 31 % to 74 %, biased toward higher GC content genomes. Automated, manual, and semi-manual methods were used to tally errors in three widely used gene calling methods, as evidenced by peptides mapped outside the boundaries of called genes.ResultsWe found that the consensus set of identical genes predicted by the three methods constitutes only about 70 % of the genes predicted by each individual method (with start and stop required to coincide). Peptide data was useful for evaluating some of the differences between gene callers, but not reliable enough to make the results conclusive, due to limitations inherent in any proteogenomic study.ConclusionsA single, unambiguous, unanimous best practice did not emerge from this analysis, since the available proteomics data were not adequate to provide an objective measurement of differences in the accuracy between these methods. However, as a result of this study, software, reference data, and procedures have been better matched among participants, representing a step toward a much-needed standard. In the absence of sufficient amount of exprimental data to achieve a universal standard, our recommendation is that any of these methods can be used by the community, as long as a single method is employed across all datasets to be compared.

[1]  Natalia N. Ivanova,et al.  GenePRIMP: a gene prediction improvement pipeline for prokaryotic genomes , 2010, Nature Methods.

[2]  Samuel H. Payne,et al.  Proteogenomic Analysis of Bacteria and Archaea: A 46 Organism Case Study , 2011, PloS one.

[3]  M. Houghton,et al.  Heterocyst Pattern Formation Controlled by a Diffusible Peptide , 1998 .

[4]  Patrick Wincker,et al.  Genome analysis and genome-wide proteomics of Thermococcus gammatolerans, the most radioresistant organism known amongst the Archaea , 2009, Genome Biology.

[5]  Johannes Griss,et al.  The Proteomics Identifications (PRIDE) database and associated tools: status in 2013 , 2012, Nucleic Acids Res..

[6]  Steven Salzberg,et al.  Identifying bacterial genes and endosymbiont DNA with Glimmer , 2007, Bioinform..

[7]  M. Baudet,et al.  Proteomics-based Refinement of Deinococcus deserti Genome Annotation Reveals an Unwonted Use of Non-canonical Translation Initiation Codons , 2009, Molecular & Cellular Proteomics.

[8]  Miriam L. Land,et al.  Trace: Tennessee Research and Creative Exchange Prodigal: Prokaryotic Gene Recognition and Translation Initiation Site Identification Recommended Citation Prodigal: Prokaryotic Gene Recognition and Translation Initiation Site Identification , 2022 .

[9]  Nikos Kyrpides,et al.  The Genomes OnLine Database (GOLD) v.5: a metadata management system based on a four level (meta)genome project classification , 2014, Nucleic Acids Res..

[10]  N. Kyrpides Fifteen years of microbial genomics: meeting the challenges and fulfilling the dream , 2009, Nature Biotechnology.

[11]  O. Poch,et al.  Ortho-proteogenomics: multiple proteomes investigation through orthology and a new MS-based protocol. , 2008, Genome research.

[12]  Bernd Thiede,et al.  Validating divergent ORF annotation of the Mycobacterium leprae genome through a full translation data set and peptide identification by tandem mass spectrometry , 2009, Proteomics.

[13]  R. Guigó,et al.  Improving gene annotation using peptide mass spectrometry. , 2007, Genome research.

[14]  Tatiana A. Tatusova,et al.  RefSeq microbial genomes database: new representation and annotation strategy , 2013, Nucleic Acids Res..

[15]  Mark Borodovsky,et al.  Gene identification in prokaryotic genomes, phages, metagenomes, and EST sequences with GeneMarkS suite. , 2011, Current protocols in bioinformatics.

[16]  J. Armengaud,et al.  Expanding the Known Repertoire of Virulence Factors Produced by Bacillus cereus through Early Secretome Profiling in Three Redox Conditions , 2010, Molecular & Cellular Proteomics.

[17]  V. Bafna,et al.  Proteogenomics to discover the full coding content of genomes: a computational perspective. , 2010, Journal of proteomics.