PHANOTATE: a novel approach to gene identification in phage genomes

Abstract Motivation Currently there are no tools specifically designed for annotating genes in phages. Several tools are available that have been adapted to run on phage genomes, but due to their underlying design, they are unable to capture the full complexity of phage genomes. Phages have adapted their genomes to be extremely compact, having adjacent genes that overlap and genes completely inside of other longer genes. This non-delineated genome structure makes it difficult for gene prediction using the currently available gene annotators. Here we present PHANOTATE, a novel method for gene calling specifically designed for phage genomes. Although the compact nature of genes in phages is a problem for current gene annotators, we exploit this property by treating a phage genome as a network of paths: where open reading frames are favorable, and overlaps and gaps are less favorable, but still possible. We represent this network of connections as a weighted graph, and use dynamic programing to find the optimal path. Results We compare PHANOTATE to other gene callers by annotating a set of 2133 complete phage genomes from GenBank, using PHANOTATE and the three most popular gene callers. We found that the four programs agree on 82% of the total predicted genes, with PHANOTATE predicting more genes than the other three. We searched for these extra genes in both GenBank’s non-redundant protein database and all of the metagenomes in the sequence read archive, and found that they are present at levels that suggest that these are functional protein-coding genes. Availability and implementation https://github.com/deprekate/PHANOTATE Supplementary information Supplementary data are available at Bioinformatics online.

[1]  Eric Jones,et al.  SciPy: Open Source Scientific Tools for Python , 2001 .

[2]  Sándor Suhai,et al.  Genomics and Proteomics , 2002, Springer US.

[3]  M. Frith,et al.  Adaptive seeds tame genomic sequence comparison. , 2011, Genome research.

[4]  Miriam L. Land,et al.  Trace: Tennessee Research and Creative Exchange Prodigal: Prokaryotic Gene Recognition and Translation Initiation Site Identification Recommended Citation Prodigal: Prokaryotic Gene Recognition and Translation Initiation Site Identification , 2022 .

[5]  Robert A. Edwards,et al.  PhiSpy: a novel algorithm for finding prophages in bacterial genomes that combines similarity- and composition-based strategies , 2012, Nucleic acids research.

[6]  T. Lindvall ON A ROUTING PROBLEM , 2004, Probability in the Engineering and Informational Sciences.

[7]  Manoj Rajaure,et al.  Genetic Analysis of the Lambda Spanins Rz and Rz1: Identification of Functional Domains , 2016, G3: Genes, Genomes, Genetics.

[8]  Ry Young,et al.  Rz/Rz1 lysis gene equivalents in phages of Gram-negative hosts. , 2007, Journal of molecular biology.

[9]  Richard Bellman,et al.  ON A ROUTING PROBLEM , 1958 .

[10]  Barbara A. Bailey,et al.  Prophage genomics reveals patterns in phage genome organization and replication , 2017, bioRxiv.

[11]  Evan Bolton,et al.  Database resources of the National Center for Biotechnology Information , 2017, Nucleic Acids Res..

[12]  Jin Wang,et al.  Multivariate Entropy Distance Method for Prokaryotic Gene Identification , 2004, J. Bioinform. Comput. Biol..

[13]  R. Overbeek,et al.  Phage Genome Annotation Using the RAST Pipeline. , 2018, Methods in molecular biology.

[14]  Gaël Varoquaux,et al.  Proceedings of the 20th Python in Science Conference 2021 (SciPy 2021), Virtual Conference, July 12 - July 18, 2021 , 2008, SciPy.

[15]  G. Olsen,et al.  CRITICA: coding region identification tool invoking comparative analysis. , 1999, Molecular biology and evolution.

[16]  Martin C. Frith,et al.  Frameshift alignment: statistics and post-genomic applications , 2014, Bioinform..

[17]  R. Edwards,et al.  The Phage Proteomic Tree: a Genome-Based Taxonomy for Phage , 2002, Journal of bacteriology.

[18]  Cathy H. Wu,et al.  Protein family classification and functional annotation , 2003, Comput. Biol. Chem..

[19]  Jacob Cohen Statistical Power Analysis for the Behavioral Sciences , 1969, The SAGE Encyclopedia of Research Design.

[20]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[21]  David L. Wheeler,et al.  GenBank , 2015, Nucleic Acids Res..

[22]  W. Jacobs,et al.  Genomics and Proteomics of Mycobacteriophage Patience, an Accidental Tourist in the Mycobacterium Neighborhood , 2014, mBio.

[23]  Jim Fowler,et al.  Practical Statistics for Field Biology , 1991 .

[24]  L. R. Ford,et al.  NETWORK FLOW THEORY , 1956 .

[25]  F. Rohwer,et al.  Metagenomics and future perspectives in virus discovery , 2012, Current Opinion in Virology.

[26]  Torsten Seemann,et al.  Prokka: rapid prokaryotic genome annotation , 2014, Bioinform..

[27]  Robert A. Edwards,et al.  PARTIE: a partition engine to separate metagenomic and amplicon projects in the Sequence Read Archive , 2017, Bioinform..

[28]  R. Mandrell,et al.  Top-Down Proteomic Identification of Shiga Toxin 2 Subtypes from Shiga Toxin-Producing Escherichia coli by Matrix-Assisted Laser Desorption Ionization–Tandem Time of Flight Mass Spectrometry , 2014, Applied and Environmental Microbiology.

[29]  M. Borodovsky,et al.  Heuristic approach to deriving models for gene finding. , 1999, Nucleic acids research.

[30]  David S. Wishart,et al.  PHASTER: a better, faster version of the PHAST phage search tool , 2016, Nucleic Acids Res..

[31]  I. Cuthill,et al.  Effect size, confidence interval and statistical significance: a practical guide for biologists , 2007, Biological reviews of the Cambridge Philosophical Society.

[32]  Tanja Woyke,et al.  Viral dark matter and virus–host interactions resolved from publicly available microbial genomes , 2015, eLife.

[33]  Skipper Seabold,et al.  Statsmodels: Econometric and Statistical Modeling with Python , 2010, SciPy.