BlastFrost: fast querying of 100,000s of bacterial genomes in Bifrost graphs

BlastFrost is a highly efficient method for querying 100,000s of genome assemblies. It builds on Bifrost, a recently developed dynamic data structure for compacted and colored de Bruijn graphs from bacterial genomes. BlastFrost queries a Bifrost data structure for sequences of interest, and extracts local subgraphs, thereby enabling the efficient identification of the presence or absence of individual genes or single nucleotide sequence variants. Here we describe the algorithms and implementation of BlastFrost. We also present two exemplar practical applications. In the first, we determined the presence of the individual genes within the SPI-2 Salmonella pathogenicity island within a collection of 926 representative genomes in minutes. In the second application, we determined the existence of known single nucleotide polymorphisms associated with fluoroquinolone resistance in the genes gyrA, gyrB and parE among 190, 209 Salmonella genomes. BlastFrost is available for download at https://github.com/nluhmann/BlastFrost.

[1]  H. Ochman,et al.  Distribution of pathogenicity islands in Salmonella spp , 1996, Infection and immunity.

[2]  Phelim Bradley,et al.  Ultra-fast search of all deposited bacterial and viral genomic data , 2019, Nature Biotechnology.

[3]  Michael A. Bender,et al.  Mantis: A Fast, Small, and Exact Large-Scale Sequence-Search Index. , 2018, Cell systems.

[4]  Robert C. Edgar,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2001 .

[5]  Shaohua Zhao,et al.  Validating the AMRFinder Tool and Resistance Gene Database by Using Antimicrobial Resistance Genotype-Phenotype Correlations in a Collection of Isolates , 2019, Antimicrobial Agents and Chemotherapy.

[6]  Christina Boucher,et al.  Data structures based on k-mers for querying large collections of sequencing data sets , 2019, bioRxiv.

[7]  Alexandre P. Francisco,et al.  GrapeTree: visualization of core genomic relationships among 100,000 bacterial pathogens , 2017, bioRxiv.

[8]  Data structures based on k-mers for querying large collections of sequencing data sets. , 2020, Genome research.

[9]  Nabil-Fareed Alikhan,et al.  A genomic overview of the population structure of Salmonella , 2018, PLoS genetics.

[10]  Hideaki Sugawara,et al.  The Sequence Read Archive , 2010, Nucleic Acids Res..

[11]  Prashant Pandey,et al.  Rainbowfish: A Succinct Colored de Bruijn Graph Representation , 2017 .

[12]  The Statistics of Sequence Similarity Scores , 2002 .

[13]  Tom O. Delmont,et al.  Anvi’o: an advanced analysis and visualization platform for ‘omics data , 2015, PeerJ.

[14]  Ruth Timme,et al.  Practical Value of Food Pathogen Traceability through Building a Whole-Genome Sequencing Network and Database , 2016, Journal of Clinical Microbiology.

[15]  Páll Melsted,et al.  Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs , 2019, Genome Biology.

[16]  Russell Schwartz,et al.  17th International Workshop on Algorithms in Bioinformatics (WABI 2017) , 2017 .

[17]  Vincent Lacroix,et al.  A fast and agnostic method for bacterial genome-wide association studies: Bridging the gap between k-mers and genetic events , 2018, PLoS genetics.

[18]  H. Ochman,et al.  Bacterial genomes as new gene homes: the genealogy of ORFans in E. coli. , 2004, Genome research.

[19]  G. McVean,et al.  De novo assembly and genotyping of variants using colored de Bruijn graphs , 2011, Nature Genetics.

[20]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[21]  D. Haft,et al.  Using the NCBI AMRFinder Tool to Determine Antimicrobial Resistance Genotype-Phenotype Correlations Within a Collection of NARMS Isolates , 2019, bioRxiv.

[22]  J. Wain,et al.  A multiplex single nucleotide polymorphism typing assay for detecting mutations that result in decreased fluoroquinolone susceptibility in Salmonella enterica serovars Typhi and Paratyphi A , 2010, The Journal of antimicrobial chemotherapy.

[23]  M. Hensel Evolution of pathogenicity islands of Salmonella enterica. , 2004, International journal of medical microbiology : IJMM.

[24]  Eduardo N. Taboada,et al.  The Salmonella In Silico Typing Resource (SISTR): An Open Web-Accessible Tool for Rapidly Typing and Subtyping Draft Salmonella Genome Assemblies , 2016, PloS one.

[25]  Keith A Jolley,et al.  Open-access bacterial population genomics: BIGSdb software, the PubMLST.org website and their applications , 2018, Wellcome open research.

[26]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[27]  Jian Yang,et al.  VFDB 2019: a comparative pathogenomic platform with an interactive web interface , 2018, Nucleic Acids Res..

[28]  M. Achtman,et al.  Accurate reconstruction of bacterial pan- and core genomes with PEPPAN , 2020, bioRxiv.

[29]  Paul Medvedev,et al.  Informed and automated k-mer size selection for genome assembly , 2013, Bioinform..

[30]  Sophie S Abby,et al.  Phylogenetic modeling of lateral gene transfer reconstructs the pattern and relative timing of speciations , 2012, Proceedings of the National Academy of Sciences.

[31]  Raymond Lo,et al.  CARD 2017: expansion and model-centric curation of the comprehensive antibiotic resistance database , 2016, Nucleic Acids Res..

[32]  M. Achtman,et al.  The EnteroBase user's guide, with case studies on Salmonella transmissions, Yersinia pestis phylogeny, and Escherichia core genomic diversity , 2019, Genome research.

[33]  Zhemin Zhou,et al.  Multilocus Sequence Typing as a Replacement for Serotyping in Salmonella enterica , 2012, PLoS pathogens.

[34]  A. Potter,et al.  The Salmonella Pathogenicity Island-1 and -2 Encoded Type III Secretion Systems , 2012 .

[35]  Peter Gerner-Smidt,et al.  PulseNet: Entering the Age of Next-Generation Sequencing , 2019, Foodborne pathogens and disease.

[36]  M. Hensel,et al.  Salmonella Pathogenicity Island 2 , 2000, Molecular microbiology.

[37]  Heng Li,et al.  Minimap2: pairwise alignment for nucleotide sequences , 2017, Bioinform..

[38]  Peer Bork,et al.  Interactive Tree Of Life (iTOL): an online tool for phylogenetic tree display and annotation , 2007, Bioinform..

[39]  Christina Boucher,et al.  Building large updatable colored de Bruijn graphs via merging , 2019, Bioinform..

[40]  H. Tettelin,et al.  The microbial pan-genome. , 2005, Current opinion in genetics & development.

[41]  D. Falush,et al.  The speciation and hybridization history of the genus Salmonella , 2019, bioRxiv.

[42]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.