Annotating Metagenomically Assembled Bacteriophage from a Unique Ecological System using Protein Structure Prediction and Structure Homology Search

Emergent long read sequencing technologies such as Oxford’s Nanopore platform are invaluable in constructing high quality and complete genomes from a metagenome, and are needed investigate unique ecosystems on a genetic level. However, generating informative functional annotations from sequences which are highly divergent to existing nucleotide and protein sequence databases is a major challenge. In this study, we present wet and dry lab techniques which allowed us to generate 5432 high quality sub-genomic sized metagenomic circular contigs from 10 samples of microbial communities. This unique ecological system exists in an environment enriched with naphthenic acid (NA), which is a major toxic byproduct in crude oil refining and the major carbon source to this community. Annotation by sequence homology alone was insufficient to characterize the community, so as proof of principle we took a subset of 227 putative bacteriophage and greatly improved our existing annotations by predicting the structures of hypothetical proteins with ColabFold and using structural homology searching with Foldseek. The proportion of proteins for each bacteriophage that were highly similar to known proteins increased from approximately 10% to about 50%, while the number of annotations with KEGG or GO terms increased from essentially 0% to 15%. Therefore, protein structure prediction and homology searches can produce more informative annotations for microbes in unique ecological systems. The characterization of novel microbial ecosystems involved in the bioremediation of crude oil-process-affected wastewater can be greatly improved and this method opens the door to the discovery of novel NA degrading pathways. IMPORTANCE Functional annotation of metagenomic assembled sequences from novel or unique microbial communities is challenging when the sequences are highly dissimilar to organisms or proteins in the known databases. This is a major obstacle for researchers attempting to characterize the functional capabilities of unique ecosystems. In this study, we demonstrate that including protein structure prediction and homology search based methods vastly improves the annotation of predicted genes identified in novel putative bacteriophage in a bacterial community that degrades naphthenic acids the major toxic component of oil refinery wastewater. This method can be extended to similar genomics studies of unique, uncharacterized ecosystems, to improve their annotations. Please read the Instructions to Authors carefully, or browse the FAQs for further details.

[1]  Charles Coluzzi,et al.  Origins of transfer establish networks of functional dependencies for plasmid transfer by conjugation , 2022, Nucleic acids research.

[2]  S. Miyano,et al.  Identification of bacteriophage genome sequences with representation learning , 2021, bioRxiv.

[3]  J. Söding,et al.  Fast and accurate protein structure search with Foldseek , 2022, bioRxiv.

[4]  A. Goesmann,et al.  Bakta: rapid and standardized annotation of bacterial genomes via alignment-free sequence identification , 2021, Microbial genomics.

[5]  S. Ovchinnikov,et al.  ColabFold: making protein folding accessible to all , 2022, Nature Methods.

[6]  Anushya Muruganujan,et al.  The Gene Ontology resource: enriching a GOld mine , 2020, Nucleic Acids Res..

[7]  Silvio C. E. Tosatto,et al.  Pfam: The protein families database in 2021 , 2020, Nucleic Acids Res..

[8]  Qing X. Li,et al.  Characteristics of bacterial populations in an industrial scale petrochemical wastewater treatment plant: Composition, function and their association with environmental factors. , 2020, Environmental research.

[9]  Robert D. Finn,et al.  A unified catalog of 204,938 reference genomes from the human gut microbiome , 2020, Nature Biotechnology.

[10]  A. Arkin,et al.  A method for achieving complete microbial genomes and improving bins from metagenomics data , 2020, bioRxiv.

[11]  A. Bhatt,et al.  Complete, closed bacterial genomes from microbiomes using nanopore sequencing , 2020, Nature Biotechnology.

[12]  I. Gates,et al.  On naphthenic acids removal from crude oil and oil sands process-affected water , 2019, Fuel.

[13]  P. Pevzner,et al.  metaFlye: scalable long-read metagenome assembly using repeat graphs , 2019, Nature Methods.

[14]  Feng Li,et al.  MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies , 2019, PeerJ.

[15]  Yu Lin,et al.  Assembly of long, error-prone reads using repeat graphs , 2018, Nature Biotechnology.

[16]  Wouter De Coster,et al.  NanoPack: visualizing and processing long-read sequencing data , 2018, bioRxiv.

[17]  Brent S. Pedersen,et al.  Mosdepth: quick coverage calculation for genomes and exomes , 2017, bioRxiv.

[18]  Heng Li,et al.  Minimap2: pairwise alignment for nucleotide sequences , 2017, Bioinform..

[19]  Niranjan Nagarajan,et al.  Fast and accurate de novo genome assembly from long uncorrected reads. , 2017, Genome research.

[20]  Minoru Kanehisa,et al.  KEGG: new perspectives on genomes, pathways, diseases and drugs , 2016, Nucleic Acids Res..

[21]  Connor T. Skennerton,et al.  CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes , 2015, Genome research.

[22]  J. Headley,et al.  In Situ Bioremediation of Naphthenic Acids Contaminated Tailing Pond Waters in the Athabasca Oil Sands Region—Demonstrated Field Studies and Plausible Options: A Review , 2005, Journal of environmental science and health. Part A, Toxic/hazardous substances & environmental engineering.

[23]  D. Kell,et al.  The Kyoto Encyclopedia of Genes and Genomes—KEGG , 2000, Yeast.

[24]  J F Gibrat,et al.  Surprising similarities in structure comparison. , 1996, Current opinion in structural biology.

[25]  C. Whitby Microbial naphthenic Acid degradation. , 2010, Advances in applied microbiology.