zol & fai: large-scale targeted detection and evolutionary investigation of gene clusters

Many universally and conditionally important genes are genomically aggregated within clusters. Here, we introduce fai and zol, which together enable large-scale comparative analysis of different types of gene clusters and mobile-genetic elements (MGEs), such as biosynthetic gene clusters (BGCs) or viruses. Fundamentally, they overcome a current bottleneck to reliably perform comprehensive orthology inference at large scale across broad taxonomic contexts and thousands of genomes. First, fai allows the identification of orthologous or homologous instances of a query gene cluster of interest amongst a database of target genomes. Subsequently, zol enables reliable, context-specific inference of protein-encoding ortholog groups for individual genes across gene cluster instances. In addition, zol performs functional annotation and computes a variety of statistics for each inferred ortholog group. These programs are showcased through application to: (i) longitudinal tracking of a virus in metagenomes, (ii) discovering novel population-genetic insights of two common BGCs in a fungal species, and (iii) uncovering large-scale evolutionary trends of a virulence-associated gene cluster across thousands of genomes from a diverse bacterial genus.

[1]  A. Earl,et al.  Global diversity of enterococci and description of 18 novel species , 2023, bioRxiv.

[2]  Karthik Anantharaman,et al.  Viral impacts on microbial activity and biogeochemical cycling in a seasonally anoxic freshwater lake , 2023, bioRxiv.

[3]  D. V. Tyne,et al.  Targeted IS-element sequencing uncovers transposition dynamics during selective pressure in enterococci , 2023, bioRxiv.

[4]  C. Médigue,et al.  NetSyn: genomic context exploration of protein families , 2023, bioRxiv.

[5]  Jingyuan Fu,et al.  gutSMASH predicts specialized primary metabolic pathways from the human gut microbiota , 2023, Nature Biotechnology.

[6]  Thomas J. Booth,et al.  CAGECAT: The CompArative GEne Cluster Analysis Toolbox for rapid search and visualisation of homologous gene clusters , 2023, bioRxiv.

[7]  Y. Yu,et al.  Fast and robust metagenomic sequence comparison through sparse chaining with skani , 2023, bioRxiv.

[8]  Thomas J. Booth,et al.  MIBiG 3.0: a community-driven effort to annotate experimentally validated biosynthetic gene clusters , 2022, Nucleic Acids Res..

[9]  A. Rokas,et al.  Genomic and Phenotypic Trait Variation of the Opportunistic Human Pathogen Aspergillus flavus and Its Close Relatives , 2022, Microbiology spectrum.

[10]  Damion M. Dooley,et al.  CARD 2023: expanded curation, support for machine learning, and resistome prediction at the Comprehensive Antibiotic Resistance Database , 2022, Nucleic Acids Res..

[11]  Heng Li Protein-to-genome alignment with miniprot , 2022, Bioinformatics.

[12]  Rauf Salamzade,et al.  Evolutionary investigations of the biosynthetic diversity in the skin microbiome using lsaBGC , 2022, bioRxiv.

[13]  Robert C. Edgar,et al.  Muscle5: High-accuracy alignment ensembles enable unbiased assessments of sequence homology and phylogeny , 2021, bioRxiv.

[14]  Jun Tian,et al.  Post‐translational modifications drive secondary metabolite biosynthesis in Aspergillus : a review , 2022, Environmental microbiology.

[15]  Adrian M. Altenhoff,et al.  The Quest for Orthologs orthology benchmark service in 2022 , 2022, Nucleic Acids Res..

[16]  Amogelang R. Raphenya,et al.  Enabling genomic island prediction and comparison in multiple genomes to investigate bacterial evolution and outbreaks , 2022, Microbial genomics.

[17]  M. Larralde Pyrodigal: Python bindings and interface to Prodigal, an efficient method for gene prediction in prokaryotes , 2022, J. Open Source Softw..

[18]  Christina A. Cuomo,et al.  Inter-species geographic signatures for tracing horizontal gene transfer and long-term persistence of carbapenem resistance , 2021, Genome Medicine.

[19]  Karthik Anantharaman,et al.  Biogeochemistry Goes Viral: towards a Multifaceted Approach To Study Viruses and Biogeochemical Cycling , 2021, mSystems.

[20]  Donovan H. Parks,et al.  GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy , 2021, Nucleic Acids Res..

[21]  Jeffrey M. Skerker,et al.  Chromosome assembled and annotated genome sequence of Aspergillus flavus NRRL 3357 , 2021, G3.

[22]  Jeffrey M. Skerker,et al.  Microevolution in the pansecondary metabolome of Aspergillus flavus and its potential macroevolutionary implications for filamentous fungi , 2021, Proceedings of the National Academy of Sciences.

[23]  Alexander M. Kloosterman,et al.  antiSMASH 6.0: improving cluster detection and comparison capabilities , 2021, Nucleic Acids Res..

[24]  Victòria Pascal Andreu,et al.  Generation of lineage-resolved complete metagenome-assembled genomes by precision phasing , 2021, bioRxiv.

[25]  Sean M. Kearney,et al.  Elevated rates of horizontal gene transfer in the industrialized human microbiome , 2021, Cell.

[26]  M. Loessner,et al.  Beyond antibacterials - exploring bacteriophages as antivirulence agents. , 2020, Current opinion in biotechnology.

[27]  Narmada Thanki,et al.  RefSeq: expanding the Prokaryotic Genome Annotation Pipeline reach with protein family model curation , 2020, Nucleic Acids Res..

[28]  T. Hackl,et al.  Virophages and retrotransposons colonize the genomes of a heterotrophic flagellate , 2020, bioRxiv.

[29]  Cameron L.M. Gilchrist,et al.  clinker & clustermap.js: Automatic generation of gene cluster comparison figures , 2020, bioRxiv.

[30]  Cameron L.M. Gilchrist,et al.  cblaster: a remote search tool for rapid identification and visualization of homologous gene clusters , 2020, bioRxiv.

[31]  Silvio C. E. Tosatto,et al.  Pfam: The protein families database in 2021 , 2020, Nucleic Acids Res..

[32]  Juliel Espinosa,et al.  Lytic Bacteriophages Facilitate Antibiotic Sensitization of Enterococcus faecium , 2020, Antimicrobial Agents and Chemotherapy.

[33]  N. Kelleher,et al.  An interpreted atlas of biosynthetic gene clusters from 1,000 fungal genomes , 2020, Proceedings of the National Academy of Sciences.

[34]  Y. Guérardel,et al.  Complete Structure of the Enterococcal Polysaccharide Antigen (EPA) of Vancomycin-Resistant Enterococcus faecalis V583 Reveals that EPA Decorations Are Teichoic Acids Covalently Linked to a Rhamnopolysaccharide Backbone , 2020, mBio.

[35]  Karthik Anantharaman,et al.  VIBRANT: automated recovery, annotation and curation of microbial viruses, and evaluation of viral community function from genomic sequences , 2020, Microbiome.

[36]  Christopher Beaudoin,et al.  Producing polished prokaryotic pangenomes with the Panaroo pipeline , 2020, Genome Biology.

[37]  J. Banfield,et al.  Consistent Metagenome-Derived Metrics Verify and Delineate Bacterial Species Boundaries , 2020, mSystems.

[38]  J. Thompson,et al.  A benchmark study of ab initio gene prediction methods in diverse eukaryotic organisms , 2019, BMC Genomics.

[39]  K. Palmer,et al.  Parallel Genomics Uncover Novel Enterococcal-Bacteriophage Interactions , 2019, mBio.

[40]  S. Kelly,et al.  OrthoFinder: phylogenetic orthology inference for comparative genomics , 2019, Genome Biology.

[41]  Marnix H. Medema,et al.  A computational framework to explore large-scale biosynthetic diversity , 2019, Nature Chemical Biology.

[42]  Stephanie J. Spielman,et al.  HyPhy 2.5 - a customizable platform for evolutionary hypothesis testing using phylogenies. , 2019, Molecular biology and evolution.

[43]  S. Elena,et al.  Evolution and ecology of plant viruses , 2019, Nature Reviews Microbiology.

[44]  Evelien M. Adriaenssens,et al.  Taxonomic assignment of uncultivated prokaryotic virus genomes is enabled by gene-sharing networks , 2019, Nature Biotechnology.

[45]  Eric J Alm,et al.  Adaptive Evolution within Gut Microbiomes of Healthy People. , 2019, Cell host & microbe.

[46]  A. Hounslow,et al.  Decoration of the enterococcal polysaccharide antigen EPA is essential for virulence, cell surface charge and interaction with effectors of the innate immune system , 2019, PLoS pathogens.

[47]  Hiroyuki Ogata,et al.  KofamKOALA: KEGG Ortholog assignment based on profile HMM and adaptive score threshold , 2019, bioRxiv.

[48]  P. Bork,et al.  Interactive Tree Of Life (iTOL) v4: recent updates and new developments , 2019, Nucleic Acids Res..

[49]  Alyxandria M. Schubert,et al.  Bacteriophage Resistance Alters Antibiotic-Mediated Intestinal Expansion of Enterococci , 2019, Infection and Immunity.

[50]  B. Murray,et al.  Loss of a Major Enterococcal Polysaccharide Antigen (Epa) by Enterococcus faecalis Is Associated with Increased Resistance to Ceftriaxone and Carbapenems , 2019, Antimicrobial Agents and Chemotherapy.

[51]  Cara H. Haney,et al.  Convergent gain and loss of genomic islands drive lifestyle changes in plant-associated Pseudomonas , 2019, The ISME Journal.

[52]  Xiao Hu,et al.  SwiftOrtho: A fast, memory-efficient, multiple genome orthology classifier , 2019, bioRxiv.

[53]  Michael D Lee,et al.  GToTree: a user-friendly workflow for phylogenomics , 2019, bioRxiv.

[54]  Meng Liu,et al.  ICEberg 2.0: an updated database of bacterial integrative and conjugative elements , 2018, Nucleic Acids Res..

[55]  Jian Yang,et al.  VFDB 2019: a comparative pathogenomic platform with an interactive web interface , 2018, Nucleic Acids Res..

[56]  Thomas Abeel,et al.  SynerClust: a highly scalable, synteny-aware orthologue clustering tool , 2018, Microbial genomics.

[57]  H. Woodrow,et al.  : A Review of the , 2018 .

[58]  Wataru Iwasaki,et al.  SonicParanoid: fast, accurate and easy orthology inference , 2018, Bioinform..

[59]  Renan Valieris,et al.  Bioconda: sustainable and comprehensive software distribution for the life sciences , 2018, Nature Methods.

[60]  Brian C. Thomas,et al.  Novel soil bacteria possess diverse genes for secondary metabolite biosynthesis , 2018, Nature.

[61]  Wenwen Huo,et al.  Loss-of-Function Mutations in epaR Confer Resistance to ϕNPV1 Infection in Enterococcus faecalis OG1RF , 2018, Antimicrobial Agents and Chemotherapy.

[62]  A. Knezevich,et al.  Clinical management of non-faecium non-faecalis vancomycin-resistant enterococci infection. Focus on Enterococcus gallinarum and Enterococcus casseliflavus/flavescens. , 2018, Journal of infection and chemotherapy : official journal of the Japan Society of Chemotherapy.

[63]  A. Phillippy,et al.  High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries , 2017, Nature Communications.

[64]  Jacob Schreiber,et al.  Pomegranate: fast and flexible probabilistic modeling in python , 2017, J. Mach. Learn. Res..

[65]  Harry A. Thorpe,et al.  Piggy: a rapid, large-scale pan-genome analysis tool for intergenic regions in bacteria , 2017, bioRxiv.

[66]  Satya Prakash Gubbala,et al.  Aspergillus Secondary Metabolite Database, a resource to understand the Secondary metabolome of Aspergillus genus , 2017, Scientific Reports.

[67]  J. Banfield,et al.  dRep: a tool for fast and accurate genomic comparisons that enables improved genome recovery from metagenomes through de-replication , 2017, The ISME Journal.

[68]  Michael S. Gilmore,et al.  Tracing the Enterococci from Paleozoic Origins to the Hospital , 2017, Cell.

[69]  Matthew R. Laird,et al.  IslandViewer 4: expanded prediction of genomic islands for larger-scale datasets , 2017, Nucleic Acids Res..

[70]  S. Lewenza,et al.  Exopolysaccharide-Repressing Small Molecules with Antibiofilm and Antivirulence Activity against Pseudomonas aeruginosa , 2017, Antimicrobial Agents and Chemotherapy.

[71]  Eugene V. Koonin,et al.  Prokaryotic Virus Orthologous Groups (pVOGs): a resource for comparative genomics and protein family annotation , 2016, Nucleic Acids Res..

[72]  Ismail Moghul,et al.  GeneValidator: identify problems with protein-coding gene predictions , 2016, Bioinform..

[73]  Anna E. Sheppard,et al.  Nested Russian Doll-Like Genetic Mobility Drives Rapid Dissemination of the Carbapenem Resistance Gene blaKPC , 2015, Antimicrobial Agents and Chemotherapy.

[74]  Brian D. Ondov,et al.  Mash: fast genome and metagenome distance estimation using MinHash , 2015, Genome Biology.

[75]  D. Bhatnagar,et al.  An Aspergillus flavus secondary metabolic gene cluster containing a hybrid PKS-NRPS is necessary for synthesis of the 2-pyridones, leporins. , 2015, Fungal genetics and biology : FG & B.

[76]  Andrew J. Page,et al.  Roary: rapid large-scale prokaryote pan genome analysis , 2015, bioRxiv.

[77]  G. Dunny,et al.  Multiple Roles for Enterococcus faecalis Glycosyltransferases in Biofilm-Associated Antibiotic Resistance, Cell Envelope Integrity, and Conjugative Transfer , 2015, Antimicrobial Agents and Chemotherapy.

[78]  Andrej Sali,et al.  A Systematic Computational Analysis of Biosynthetic Gene Cluster Evolution: Lessons for Engineering Biosynthesis , 2014, PLoS Comput. Biol..

[79]  Chao Xie,et al.  Fast and sensitive protein alignment using DIAMOND , 2014, Nature Methods.

[80]  Otto X. Cordero,et al.  Explaining microbial genomic diversity in light of evolutionary ecology , 2014, Nature Reviews Microbiology.

[81]  M. Gilmore,et al.  Enterococcus Diversity, Origins in Nature, and Gut Colonization , 2014 .

[82]  Roy Kishony,et al.  Genetic variation of a bacterial pathogen within individuals with cystic fibrosis provides a record of selective pressures , 2013, Nature Genetics.

[83]  Allison D. Griggs,et al.  Emergence of Epidemic Multidrug-Resistant Enterococcus faecium from Animal and Commensal Strains , 2013, mBio.

[84]  P. Salamon,et al.  Bacteriophage adhering to mucus provide a non–host-derived immunity , 2013, Proceedings of the National Academy of Sciences.

[85]  David J. Edwards,et al.  Beginner’s guide to comparative bacterial genome analysis using next-generation sequence data , 2013, Microbial Informatics and Experimentation.

[86]  Sergei L. Kosakovsky Pond,et al.  FUBAR: a fast, unconstrained bayesian approximation for inferring selection. , 2013, Molecular biology and evolution.

[87]  R. Breitling,et al.  Detecting Sequence Homology at the Gene Cluster Level with MultiGeneBlast , 2013, Molecular biology and evolution.

[88]  G. Weinstock,et al.  Complete genome sequence of Enterococcus faecium strain TX16 and comparative genomic analysis of Enterococcus faecium genomes , 2012, BMC Microbiology.

[89]  Sean R. Eddy,et al.  Accelerated Profile HMM Searches , 2011, PLoS Comput. Biol..

[90]  Mitchell J. Sullivan,et al.  Easyfig: a genome comparison visualizer , 2011, Bioinform..

[91]  K. Ehrlich,et al.  HypC, the Anthrone Oxidase Involved in Aflatoxin Biosynthesis , 2010, Applied and Environmental Microbiology.

[92]  Paramvir S. Dehal,et al.  FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments , 2010, PloS one.

[93]  Miriam L. Land,et al.  Trace: Tennessee Research and Creative Exchange Prodigal: Prokaryotic Gene Recognition and Translation Initiation Site Identification Recommended Citation Prodigal: Prokaryotic Gene Recognition and Translation Initiation Site Identification , 2022 .

[94]  F. Teng,et al.  Further Characterization of the epa Gene Cluster and Epa Polysaccharides of Enterococcus faecalis , 2009, Infection and Immunity.

[95]  Toni Gabaldón,et al.  trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses , 2009, Bioinform..

[96]  Bartek Wilczynski,et al.  Biopython: freely available Python tools for computational molecular biology and bioinformatics , 2009, Bioinform..

[97]  W. Nierman,et al.  Potential of Aspergillus flavus genomics for applications in biotechnology. , 2009, Trends in biotechnology.

[98]  Christopher T. Walsh,et al.  The evolution of gene collectives: How natural selection drives chemical innovation , 2008, Proceedings of the National Academy of Sciences.

[99]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[100]  M. Klich Aspergillus flavus: the major producer of aflatoxin. , 2007, Molecular plant pathology.

[101]  Sergei L. Kosakovsky Pond,et al.  GARD: a genetic algorithm for recombination detection , 2006, Bioinform..

[102]  B. Finlay,et al.  Pathogenicity islands: a molecular toolbox for bacterial virulence , 2006, Cellular microbiology.

[103]  Peer Bork,et al.  PAL2NAL: robust conversion of protein sequence alignments into the corresponding codon alignments , 2006, Nucleic Acids Res..

[104]  Mark A. Jervis,et al.  The Life-cycle , 1996 .

[105]  J. Bland,et al.  The Aflatoxin Biosynthesis Cluster Gene, aflX, Encodes an Oxidoreductase Involved in Conversion of Versicolorin A to Demethylsterigmatocystin , 2006, Applied and Environmental Microbiology.

[106]  C. Médigue,et al.  MaGe: a microbial genome annotation system supported by synteny results , 2006, Nucleic acids research.

[107]  Patricia Siguier,et al.  ISfinder: the reference centre for bacterial insertion sequences , 2005, Nucleic Acids Res..

[108]  Steven Salzberg,et al.  TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders , 2004, Bioinform..

[109]  T. Standiford,et al.  The Klebsiella pneumoniae O Antigen Contributes to Bacteremia and Lethality during Murine Pneumonia , 2004, Infection and Immunity.

[110]  Harry L. T. Mobley,et al.  Pathogenic Escherichia coli , 2004, Nature Reviews Microbiology.

[111]  Korbinian Strimmer,et al.  APE: Analyses of Phylogenetics and Evolution in R language , 2004, Bioinform..

[112]  C. Stoeckert,et al.  OrthoMCL: identification of ortholog groups for eukaryotic genomes. , 2003, Genome research.

[113]  G. Weinstock,et al.  Evidence that the Enterococcal Polysaccharide Antigen Gene (epa) Cluster Is Widespread in Enterococcus faecalis and Influences Resistance to Phagocytic Killing of E. faecalis , 2002, Infection and Immunity.

[114]  Christian E. V. Storm,et al.  Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. , 2001, Journal of molecular biology.

[115]  G. Bolwell Biochemistry & Molecular Biology of Plants , 2001 .

[116]  G. Weinstock,et al.  Polysaccharide Biosynthesis from a Cluster of Genes Involved In , 1998 .

[117]  M. Slatkin,et al.  Estimation of levels of gene flow from DNA sequence data. , 1992, Genetics.

[118]  F. Tajima Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. , 1989, Genetics.

[119]  Jeremy W. Dale,et al.  Molecular Genetics of Bacteria , 1989 .

[120]  P N Goodfellow,et al.  A Genetic Switch: Gene Control and Phage λ , 1987 .

[121]  M. Ptashne A Genetic Switch: Gene Control and Phage Lambda , 1986 .

[122]  L. Lindahl,et al.  Operon-specific regulation of ribosomal protein synthesis in Escherichia coli. , 1979, Proceedings of the National Academy of Sciences of the United States of America.

[123]  Rauf Salamzade,et al.  lsaBGC provides a comprehensive framework for evolutionary analysis of biosynthetic gene clusters within focal taxa , 2022 .

[124]  Y. Ike Pathogenicity of Enterococci. , 2017, Nihon saikingaku zasshi. Japanese journal of bacteriology.

[125]  P. Serror,et al.  The surface rhamnopolysaccharide epa of Enterococcus faecalis is a key determinant of intestinal colonization. , 2015, The Journal of infectious diseases.

[126]  H. Nakajima,et al.  [Aflatoxin biosynthesis]. , 2011, Shokuhin eiseigaku zasshi. Journal of the Food Hygienic Society of Japan.

[127]  Nancy P Keller,et al.  Genomic mining for Aspergillus natural products. , 2006, Chemistry & biology.

[128]  Michael Y. Galperin,et al.  The COG database: a tool for genome-scale analysis of protein functions and evolution , 2000, Nucleic Acids Res..

[129]  Ying Gao,et al.  Bioinformatics Applications Note Sequence Analysis Cd-hit Suite: a Web Server for Clustering and Comparing Biological Sequences , 2022 .