AssessORF: combining evolutionary conservation and proteomics to assess prokaryotic gene predictions

MOTIVATION A core task of genomics is to identify the boundaries of protein coding genes, which may cover over 90% of a prokaryote's genome. Several programs are available for gene finding, yet it is currently unclear how well these programs perform and whether any offers superior accuracy. This is in part because there is no universal benchmark for gene finding and, therefore, most developers select their own benchmarking strategy. RESULTS Here, we introduce AssessORF, a new approach for benchmarking prokaryotic gene predictions based on evidence from proteomics data and the evolutionary conservation of start and stop codons. We applied AssessORF to compare gene predictions offered by GenBank, GeneMarkS-2, Glimmer, and Prodigal on genomes spanning the prokaryotic tree of life. Gene predictions were 88 - 95% in agreement with the available evidence, with Glimmer performing the worst but no clear winner. All programs were biased towards selecting start codons that were upstream of the actual start. Given these findings, there remains considerable room for improvement, especially in the detection of correct start sites. AVAILABILITY AssessORF is available as an R package via the Bioconductor package repository. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  J. Sacchettini,et al.  Reannotation of translational start sites in the genome of Mycobacterium tuberculosis. , 2013, Tuberculosis.

[2]  Eivind Valen,et al.  Ribosome signatures aid bacterial translation initiation site identification , 2017, BMC Biology.

[3]  Michael E Wall,et al.  Consistency of gene starts among Burkholderia genomes , 2011, BMC Genomics.

[4]  M. Borodovsky,et al.  GeneMark.hmm: new solutions for gene finding. , 1998, Nucleic acids research.

[5]  Mohd Firdaus Raih,et al.  Computational discovery and annotation of conserved small open reading frames in fungal genomes , 2019, BMC Bioinformatics.

[6]  Jonathan L. Klassen,et al.  ORFcor: Identifying and Accommodating ORF Prediction Inconsistencies for Phylogenetic Analysis , 2013, PloS one.

[7]  Chase W. Nelson,et al.  Discovery of numerous novel small genes in the intergenic regions of the Escherichia coli O157:H7 Sakai genome , 2017, PloS one.

[8]  P. Willems,et al.  N-terminal Proteomics Assisted Profiling of the Unexplored Translation Initiation Landscape in Arabidopsis thaliana , 2017, Molecular & Cellular Proteomics.

[9]  Xiao-Feng Tang,et al.  Alternative Translation Initiation of a Haloarchaeal Serine Protease Transcript Containing Two In-Frame Start Codons , 2016, Journal of bacteriology.

[10]  M. Borodovsky,et al.  Modeling leaderless transcription and atypical genes results in more accurate gene prediction in prokaryotes , 2018, Genome research.

[11]  Steven P Gygi,et al.  Comparative evaluation of mass spectrometry platforms used in large-scale proteomics investigations , 2005, Nature Methods.

[12]  Erik Wright,et al.  DECIPHER: harnessing local sequence context to improve protein multiple sequence alignment , 2015, BMC Bioinformatics.

[13]  François Enault,et al.  Metavir 2: new tools for viral metagenome comparison and assembled virome analysis , 2014, BMC Bioinformatics.

[14]  Erik S. Wright,et al.  Using DECIPHER v2.0 to Analyze Big Biological Sequence Data in R , 2016, R J..

[15]  Eric P. Nawrocki,et al.  NCBI prokaryotic genome annotation pipeline , 2016, Nucleic acids research.

[16]  J. Wells,et al.  Methods for the proteomic identification of protease substrates. , 2009, Current opinion in chemical biology.

[17]  Miriam L. Land,et al.  Trace: Tennessee Research and Creative Exchange Prodigal: Prokaryotic Gene Recognition and Translation Initiation Site Identification Recommended Citation Prodigal: Prokaryotic Gene Recognition and Translation Initiation Site Identification , 2022 .

[18]  Joshua E. Elias,et al.  Evaluation of multidimensional chromatography coupled with tandem mass spectrometry (LC/LC-MS/MS) for large-scale protein analysis: the yeast proteome. , 2003, Journal of proteome research.

[19]  T. Meinnel,et al.  Protein N-terminal methionine excision , 2004, Cellular and Molecular Life Sciences CMLS.

[20]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.

[21]  Judith D. Cohn,et al.  Genome Majority Vote Improves Gene Predictions , 2011, PLoS Comput. Biol..

[22]  T. Gojobori,et al.  Comparative genomic analysis of translation initiation mechanisms for genes lacking the Shine–Dalgarno sequence in prokaryotes , 2017, Nucleic acids research.

[23]  Y. Wolf,et al.  Small proteins can no longer be ignored. , 2014, Annual review of biochemistry.

[24]  Virag Sharma,et al.  Retapamulin-Assisted Ribosome Profiling Reveals the Alternative Bacterial Proteome. , 2019, Molecular cell.

[25]  K. Gevaert,et al.  Deep Proteome Coverage Based on Ribosome Profiling Aids Mass Spectrometry-based Protein and Peptide Discovery and Provides Evidence of Alternative Translation Products and Near-cognate Translation Initiation Events* , 2013, Molecular & Cellular Proteomics.

[26]  Drew Endy,et al.  Measurements of translation initiation from all 64 codons in E. coli , 2016, bioRxiv.

[27]  Steven P Gygi,et al.  Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry , 2007, Nature Methods.

[28]  M. Borodovsky,et al.  GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. , 2001, Nucleic acids research.

[29]  S. Salzberg,et al.  Microbial gene identification using interpolated Markov models. , 1998, Nucleic acids research.

[30]  Luis Serrano,et al.  Unraveling the hidden universe of small proteins in bacterial genomes , 2019, Molecular systems biology.

[31]  Hanbo Chen,et al.  VennDiagram: a package for the generation of highly-customizable Venn and Euler diagrams in R , 2011, BMC Bioinformatics.

[32]  S. Salzberg,et al.  Improved microbial gene identification with GLIMMER. , 1999, Nucleic acids research.

[33]  Raphael Gottardo,et al.  Orchestrating high-throughput genomic analysis with Bioconductor , 2015, Nature Methods.

[34]  Steven Salzberg,et al.  Identifying bacterial genes and endosymbiont DNA with Glimmer , 2007, Bioinform..

[35]  G. Storz,et al.  Identifying Small Proteins by Ribosome Profiling with Stalled Initiation Complexes , 2019, mBio.

[36]  Jindan Zhou,et al.  EcoGene 3.0 , 2012, Nucleic Acids Res..

[37]  James C. Wright,et al.  Exploiting proteomic data for genome annotation and gene model validation in Aspergillus niger , 2009, BMC Genomics.

[38]  M. Vergassola,et al.  The Listeria transcriptional landscape from saprophytism to virulence , 2009, Nature.