Rapid and accurate SNP genotyping of clonal bacterial pathogens with BioHansel

BioHansel performs high-resolution genotyping of bacterial isolates by identifying phylogenetically informative single nucleotide polymorphisms (SNPs), also known as canonical SNPs, in whole genome sequencing (WGS) data. The application uses a fast k-mer matching algorithm to map pathogen WGS data to canonical SNPs contained in hierarchically structured schemas and assigns genotypes based on the detected SNP profile. Using modest computing resources, BioHansel efficiently types isolates from raw sequence reads or assembled contigs in a matter of seconds, making it attractive for use by public health, food safety, environmental, and agricultural authorities that wish to apply WGS methodologies for their surveillance, diagnostics, and research programs. BioHansel currently provides canonical SNP genotyping schemas for four prevalent Salmonella serovars—Typhi, Typhimurium, Enteritidis and Heidelberg—as well as a schema for Mycobacterium tuberculosis. Users can also supply their own schemas for genotyping other organisms. BioHansel’s quality assurance system assesses the validity of the genotyping results and can identify low quality data, contaminated datasets, and misidentified organisms. BioHansel is targeted to support surveillance, source attribution, risk assessment, diagnostics, and rapid screening for public health purposes, such as product recalls. BioHansel is an open source application with packages available for PyPI, Conda, and the Galaxy workflow manager. In summary, BioHansel performs efficient, rapid, accurate, and high-resolution classification of bacterial genomes from sequence reads or assembled contigs on standard computing hardware. BioHansel is suitable for use as a general research tool as well as in fully operationalized WGS workflows at the front lines of infectious disease surveillance, diagnostics, and outbreak investigation and response. Impact statement Public health, food safety, environmental, and agricultural authorities are currently engaged in a global effort to incorporate whole genome sequencing technologies into their infectious disease research, surveillance, and outbreak investigation programs. Its widespread adoption, however, has been impeded by two major obstacles: the need for high performance computing to generate results and the expert knowledge required to interpret and communicate those results. BioHansel addresses these limitations by rapidly genotyping pathogens from whole genome sequence data in an accurate, simple, familiar, and easily sharable manner using standard computing resources. BioHansel provides a compact and readily interpretable genotype based on canonical SNP genotyping schemas. BioHansel’s genotyping nomenclature encodes the pathogen’s position in its population structure, which simplifies and facilitates its comparison with actively circulating strains and historical strains. The genotyping information provided by BioHansel can identify points of intervention to prevent the spread of pathogenic bacteria, screen for the presence of priority pathogens, and perform source attribution and risk assessment. Thus, BioHansel serves as a readily accessible and powerful WGS method, implementable on a laptop, for genotyping pathogens to detect, monitor, and control the emergence and spread of infectious disease through surveillance, screening, diagnostics, and outbreak investigation and response activities. Data summary BioHansel is a Python 3 application available as PyPI, Conda Galaxy Tool Shed packages. It is an open source application distributed under the Apache License, Version 2.0. Source code is available at https://github.com/phac-nml/biohansel. The BioHansel user guide is available at https://bio-hansel.readthedocs.io/en/readthedocs/. Supplementary Materials are available at https://github.com/phac-nml/biohansel-manuscript-supplementary-data. The authors confirm all supporting data, code and protocols have been provided within the article or through supplementary data files.

[1]  E. Wright,et al.  Quality filtering of Illumina index reads mitigates sample cross-talk , 2016, BMC Genomics.

[2]  Joakim,et al.  Expert Opinion on the introduction of next-generation typing methods for food- and waterborne diseases in the EU and EEA , 2015 .

[3]  Derrick W. Crook,et al.  Identifying Mixed Mycobacterium tuberculosis Infection and Laboratory Cross-Contamination during Mycobacterial Sequencing Programs , 2018, Journal of Clinical Microbiology.

[4]  G. Greub,et al.  Whole-genome sequencing for rapid, reliable and routine investigation of Mycobacterium tuberculosis transmission in local communities , 2019, New microbes and new infections.

[5]  Kathryn E. Holt,et al.  Population structure and antimicrobial resistance patterns of Salmonella Typhi isolates in Bangladesh from 2004 to 2016 , 2019, bioRxiv.

[6]  J. Bray,et al.  MLST revisited: the gene-by-gene approach to bacterial genomics , 2013, Nature Reviews Microbiology.

[7]  E Trees,et al.  Next-generation sequencing technologies and their application to the study and control of bacterial infections. , 2017, Clinical microbiology and infection : the official publication of the European Society of Clinical Microbiology and Infectious Diseases.

[8]  Stefano Lonardi,et al.  Comprehensive benchmarking and ensemble approaches for metagenomic classifiers , 2017, Genome Biology.

[9]  Justin Zobel,et al.  SRST2: Rapid genomic surveillance for public health and hospital microbiology labs , 2014, bioRxiv.

[10]  I. Comas,et al.  Pervasive contaminations in sequencing experiments are a major source of false genetic variability: a Mycobacterium tuberculosis meta-analysis , 2018, bioRxiv.

[11]  Stefan Niemann,et al.  MTBseq: a comprehensive pipeline for whole genome sequence analysis of Mycobacterium tuberculosis complex isolates , 2018, PeerJ.

[12]  John Crandall,et al.  Validation and Implementation of Clinical Laboratory Improvements Act-Compliant Whole-Genome Sequencing in the Public Health Microbiology Laboratory , 2017, Journal of Clinical Microbiology.

[13]  Nabil-Fareed Alikhan,et al.  A genomic overview of the population structure of Salmonella , 2018, PLoS genetics.

[14]  Nabil-Fareed Alikhan,et al.  Comparison of classical multi-locus sequence typing software for next-generation sequencing data , 2017, Microbial genomics.

[15]  Anders Krogh,et al.  Fast and sensitive taxonomic classification for metagenomics with Kaiju , 2016, Nature Communications.

[16]  Francesc Coll,et al.  A robust SNP barcode for typing Mycobacterium tuberculosis complex strains , 2014, Nature Communications.

[17]  R. Goering,et al.  Whole genome sequencing options for bacterial strain typing and epidemiologic analysis based on single nucleotide polymorphism versus gene-by-gene-based approaches. , 2018, Clinical microbiology and infection : the official publication of the European Society of Clinical Microbiology and Infectious Diseases.

[18]  Jens Stoye,et al.  Updating benchtop sequencing performance comparison , 2013, Nature Biotechnology.

[19]  Alfred V. Aho,et al.  Efficient string matching , 1975, Commun. ACM.

[20]  Jacqueline A. Keane,et al.  An extended genotyping framework for Salmonella enterica serovar Typhi, the cause of human typhoid , 2016, Nature Communications.

[21]  Sudhir Kumar,et al.  MEGA7: Molecular Evolutionary Genetics Analysis Version 7.0 for Bigger Datasets. , 2016, Molecular biology and evolution.

[22]  Catherine D. Carrillo,et al.  ConFindr: rapid detection of intraspecies and cross-species contamination in bacterial whole-genome sequence data , 2019, PeerJ.

[23]  Richard Myers,et al.  SnapperDB: A database solution for routine sequencing analysis of bacterial isolates , 2017, bioRxiv.

[24]  Philip D. Blood,et al.  Critical Assessment of Metagenome Interpretation—a benchmark of metagenomics software , 2017, Nature Methods.

[25]  I. Comas,et al.  Contaminant DNA in bacterial sequencing experiments is a major source of false genetic variability , 2020, BMC Biology.

[26]  G. Domselaar,et al.  Usefulness of High-Quality Core Genome Single-Nucleotide Variant Analysis for Subtyping the Highly Clonal and the Most Prevalent Salmonella enterica Serovar Heidelberg Clone in the Context of Outbreak Investigations , 2015, Journal of Clinical Microbiology.

[27]  Gary Van Domselaar,et al.  A Comparative Analysis of the Lyve-SET Phylogenomics Pipeline for Genomic Epidemiology of Foodborne Pathogens , 2017, Front. Microbiol..

[28]  Li Song,et al.  Centrifuge: rapid and sensitive classification of metagenomic sequences , 2016 .

[29]  Derrick E. Wood,et al.  Kraken: ultrafast metagenomic sequence classification using exact alignments , 2014, Genome Biology.

[30]  Ryan R. Wick,et al.  Unicycler: Resolving bacterial genome assemblies from short and long sequencing reads , 2016, bioRxiv.

[31]  Thomas Abeel,et al.  QuantTB – a method to classify mixed Mycobacterium tuberculosis infections within whole genome sequencing data , 2019, BMC Genomics.

[32]  I. Van Walle,et al.  PulseNet International: Vision for the implementation of whole genome sequencing (WGS) for global food-borne disease surveillance , 2017, Euro surveillance : bulletin Europeen sur les maladies transmissibles = European communicable disease bulletin.

[33]  Francesc Coll,et al.  Rapid determination of anti-tuberculosis drug resistance from whole-genome sequences , 2015, Genome Medicine.

[34]  Robert G. Beiko,et al.  SNVPhyl: a single nucleotide variant phylogenomics pipeline for microbial genomic epidemiology , 2016, bioRxiv.

[35]  Liam P. Shaw,et al.  Genomic diversity affects the accuracy of bacterial SNP calling pipelines , 2019, bioRxiv.

[36]  Leping Li,et al.  ART: a next-generation sequencing read simulator , 2012, Bioinform..

[37]  Keith A Jolley,et al.  Open-access bacterial population genomics: BIGSdb software, the PubMLST.org website and their applications , 2018, Wellcome open research.

[38]  Kai Zhou,et al.  Application of next generation sequencing in clinical microbiology and infection prevention. , 2017, Journal of biotechnology.

[39]  Ryan R. Wick,et al.  Unicycler: resolving bacterial genome assemblies from short and long sequencing reads , 2016, bioRxiv.

[40]  Hugh Rand,et al.  Real-Time Pathogen Detection in the Era of Whole-Genome Sequencing and Big Data: Comparison of k-mer and Site-Based Methods for Inferring the Genetic Distances among Tens of Thousands of Salmonella Samples , 2016, PloS one.

[41]  J A Carriço,et al.  A primer on microbial bioinformatics for nonbioinformaticians. , 2018, Clinical microbiology and infection : the official publication of the European Society of Clinical Microbiology and Infectious Diseases.

[42]  L. Hoang,et al.  Infection control in the new age of genomic epidemiology. , 2017, American journal of infection control.

[43]  Frank Pollari,et al.  Targeting discriminatory SNPs in Salmonella enterica serovar Heidelberg genomes using RNase H2-dependent PCR. , 2019, Journal of microbiological methods.

[44]  P. Bork,et al.  Interactive Tree Of Life (iTOL) v4: recent updates and new developments , 2019, Nucleic Acids Res..

[45]  Yan Luo,et al.  CFSAN SNP Pipeline: an automated method for constructing SNP matrices from next-generation sequence data , 2015, PeerJ Comput. Sci..

[46]  Martin C. J. Maiden,et al.  BIGSdb: Scalable analysis of bacterial genome variation at the population level , 2010, BMC Bioinformatics.

[47]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[48]  Pardis C. Sabeti,et al.  Benchmarking Metagenomics Tools for Taxonomic Classification , 2019, Cell.

[49]  Qiushui He,et al.  SNP-Based Typing: A Useful Tool to Study Bordetella pertussis Populations , 2011, PloS one.

[50]  S. Salzberg,et al.  Centrifuge: rapid and sensitive classification of metagenomic sequences , 2016, bioRxiv.

[51]  Fang Hu,et al.  Whole genome sequencing for investigations of meningococcal outbreaks in the United States: a retrospective analysis , 2018, Scientific Reports.

[52]  David R Murdoch,et al.  Laboratory and molecular surveillance of paediatric typhoidal Salmonella in Nepal: Antimicrobial resistance and implications for vaccine policy , 2018, bioRxiv.

[53]  Gary Van Domselaar,et al.  A Primer on Infectious Disease Bacterial Genomics , 2016, Clinical Microbiology Reviews.

[54]  Eduardo N. Taboada,et al.  The Salmonella In Silico Typing Resource (SISTR): An Open Web-Accessible Tool for Rapidly Typing and Subtyping Draft Salmonella Genome Assemblies , 2016, PloS one.

[55]  Renato H. Orsi,et al.  Comparative Analysis of Tools and Approaches for Source Tracking Listeria monocytogenes in a Food Facility Using Whole-Genome Sequence Data , 2019, Front. Microbiol..