Motivation Currently there are no tools specifically designed for annotating genes in phages. Several tools are available that have been adapted to run on phage genomes, but due to their underlying design they are unable to capture the full complexity of phage genomes. Phages have adapted their genomes to be extremely compact, having adjacent genes that overlap, and genes completely inside of other longer genes. This non-delineated genome structure makes it difficult for gene prediction using the currently available gene annotators. Here we present THEA (The Algorithm), a novel method for gene calling specifically designed for phage genomes. While the compact nature of genes in phages is a problem for current gene annotators, we exploit this property by treating a phage genome as a network of paths: where open reading frames are favorable, and overlaps and gaps are less favorable, but still possible. We represent this network of connections as a weighted graph, and use graph theory to find the optimal path. Results We compare THEA to other gene callers by annotating a set of 2,133 complete phage genomes from GenBank, using THEA and the three most popular gene callers. We found that the four programs agree on 82% of the total predicted genes, with THEA predicting significantly more genes than the other three. We searched for these extra genes in both GenBank’s non-redundant protein database and sequence read archive, and found that they are present at levels that suggest that these are functional protein coding genes. Availability and Implementation The source code and all files can be found at: https://github.com/deprekate/THEA Contact Katelyn McNair: deprekate@gmail.com
[1]
T. Lindvall.
ON A ROUTING PROBLEM
,
2004,
Probability in the Engineering and Informational Sciences.
[2]
L. R. Ford,et al.
NETWORK FLOW THEORY
,
1956
.
[3]
Tanja Woyke,et al.
Viral dark matter and virus–host interactions resolved from publicly available microbial genomes
,
2015,
eLife.
[4]
Manoj Rajaure,et al.
Genetic Analysis of the Lambda Spanins Rz and Rz1: Identification of Functional Domains
,
2016,
G3: Genes, Genomes, Genetics.
[5]
Ry Young,et al.
Rz/Rz1 lysis gene equivalents in phages of Gram-negative hosts.
,
2007,
Journal of molecular biology.
[6]
Miriam L. Land,et al.
Trace: Tennessee Research and Creative Exchange Prodigal: Prokaryotic Gene Recognition and Translation Initiation Site Identification Recommended Citation Prodigal: Prokaryotic Gene Recognition and Translation Initiation Site Identification
,
2022
.
[7]
R. Overbeek,et al.
Phage Genome Annotation Using the RAST Pipeline.
,
2018,
Methods in molecular biology.
[8]
Barbara A. Bailey,et al.
Prophage genomics reveals patterns in phage genome organization and replication
,
2017,
bioRxiv.
[9]
R. Edwards,et al.
The Phage Proteomic Tree: a Genome-Based Taxonomy for Phage
,
2002,
Journal of bacteriology.