Rapid geographical source attribution of Salmonella enterica serovar Enteritidis genomes using hierarchical machine learning

Salmonella enterica serovar Enteritidis is one of the most frequent causes of Salmonellosis globally and is commonly transmitted from animals to humans by the consumption of contaminated foodstuffs. In the UK and many other countries in the Global North, a significant proportion of cases are caused by the consumption of imported food products or contracted during foreign travel, therefore, making the rapid identification of the geographical source of new infections a requirement for robust public health outbreak investigations. Herein, we detail the development and application of a hierarchical machine learning model to rapidly identify and trace the geographical source of S. Enteritidis infections from whole genome sequencing data. 2313 S. Enteritidis genomes, collected by the UKHSA between 2014–2019, were used to train a ‘local classifier per node’ hierarchical classifier to attribute isolates to four continents, 11 sub-regions, and 38 countries (53 classes). The highest classification accuracy was achieved at the continental level followed by the sub-regional and country levels (macro F1: 0.954, 0.718, 0.661, respectively). A number of countries commonly visited by UK travelers were predicted with high accuracy (hF1: >0.9). Longitudinal analysis and validation with publicly accessible international samples indicated that predictions were robust to prospective external datasets. The hierarchical machine learning framework provided granular geographical source prediction directly from sequencing reads in <4 min per sample, facilitating rapid outbreak resolution and real-time genomic epidemiology. The results suggest additional application to a broader range of pathogens and other geographically structured problems, such as antimicrobial resistance prediction, is warranted.

[1]  S. Gharbia,et al.  Evaluation of Genomic Typing Methods in the Salmonella Reference Laboratory in Public Health, England, 2012–2020 , 2023, Pathogens.

[2]  P. Ashton,et al.  Global diversity and antimicrobial resistance of typhoid fever pathogens: insights from 13,000 Salmonella Typhi genomes , 2022, medRxiv.

[3]  C. Jenkins,et al.  Two Outbreaks of Foodborne Gastrointestinal Infection Linked to Consumption of Imported Melons, United Kingdom, March to August 2021. , 2022, Journal of food protection.

[4]  Xiangyu Deng,et al.  Global spread of Salmonella Enteritidis via centralized sourcing and international trade of poultry breeding stocks , 2021, Nature Communications.

[5]  Y. Somorin,et al.  Salmonella is the most common foodborne pathogen in African food exports to the European Union: Analysis of the Rapid Alert System for Food and Feed (1999–2019) , 2021 .

[6]  Daniel J. Wilson,et al.  Machine learning to predict the source of campylobacteriosis using whole genome data , 2021, bioRxiv.

[7]  Juno Thomas,et al.  Whole-genome sequencing to investigate two concurrent outbreaks of Salmonella Enteritidis in South Africa, 2018. , 2020, Journal of medical microbiology.

[8]  P. Njage,et al.  Application of Whole‐Genome Sequences and Machine Learning in Source Attribution of Salmonella Typhimurium , 2020, Risk analysis : an official publication of the Society for Risk Analysis.

[9]  Nadejda Lupolova,et al.  A guide to machine learning for bacterial host attribution using genome sequence data , 2019, Microbial genomics.

[10]  S. Nair,et al.  The Transformation of Reference Microbiology Methods and Surveillance for Salmonella With the Use of Whole Genome Sequencing in England and Wales , 2019, Front. Public Health.

[11]  Páll Melsted,et al.  Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs , 2019, Genome Biology.

[12]  T. Dallman,et al.  An international outbreak of Salmonella enterica serotype Enteritidis linked to eggs from Poland: a microbiological and epidemiological study. , 2019, The Lancet. Infectious diseases.

[13]  J. McLauchlin,et al.  Public health risks associated with Salmonella contamination of imported edible betel leaves: Analysis of results from England, 2011-2017. , 2019, International journal of food microbiology.

[14]  T. Dallman,et al.  Impact of whole genome sequencing on the investigation of food-borne outbreaks of Shiga toxin-producing Escherichia coli serogroup O157:H7, England, 2013 to 2017 , 2019, Euro surveillance : bulletin Europeen sur les maladies transmissibles = European communicable disease bulletin.

[15]  Beau B. Bruce,et al.  Zoonotic Source Attribution of Salmonella enterica Serotype Typhimurium Using Genomic Surveillance Data, United States , 2019, Emerging infectious diseases.

[16]  Vincent Lacroix,et al.  A fast and agnostic method for bacterial genome-wide association studies: Bridging the gap between k-mers and genetic events , 2018, bioRxiv.

[17]  Moez Sanaa,et al.  Source Attribution of Foodborne Diseases: Potentialities, Hurdles, and Future Expectations , 2018, Front. Microbiol..

[18]  S. Octavia,et al.  Retrospective genome-wide comparisons of Salmonella enterica serovar Enteritidis from suspected outbreaks in Singapore. , 2018, Infection, genetics and evolution : journal of molecular epidemiology and evolutionary genetics in infectious diseases.

[19]  Padmini Ramachandran,et al.  Genomics of foodborne pathogens for microbial food safety. , 2018, Current opinion in biotechnology.

[20]  K. Nagy,et al.  The European Union summary report on trends and sources of zoonoses, zoonotic agents and food‐borne outbreaks in 2016 , 2017, EFSA journal. European Food Safety Authority.

[21]  N. Wheeler,et al.  Machine learning identifies signatures of host adaptation in the bacterial pathogen Salmonella enterica , 2017, bioRxiv.

[22]  T. Dallman,et al.  Patchy promiscuity: machine learning applied to predict the host specificity of Salmonella enterica and Escherichia coli , 2017, Microbial genomics.

[23]  Richard Myers,et al.  SnapperDB: A database solution for routine sequencing analysis of bacterial isolates , 2017, bioRxiv.

[24]  S. Mutschall,et al.  Source attribution of human campylobacteriosis at the point of exposure by combining comparative exposure assessment and subtype comparison based on comparative genomic fingerprinting , 2017, PloS one.

[25]  J. Parkhill,et al.  Population genetic structuring of methicillin-resistant Staphylococcus aureus clone EMRSA-15 within UK reflects patient referral patterns , 2017, Microbial genomics.

[26]  L. Gould,et al.  Outbreaks of Disease Associated with Food Imported into the United States, 1996–2014 , 2017, Emerging infectious diseases.

[27]  Nigel French,et al.  sourceR: Classification and source attribution of infectious agents among heterogeneous populations , 2017, PLoS Comput. Biol..

[28]  Khalil Abudahab,et al.  Microreact: visualizing and sharing data for genomic epidemiology and phylogeography , 2016, Microbial genomics.

[29]  T Jombart,et al.  Prospective use of whole genome sequencing (WGS) detected a multi-country outbreak of Salmonella Enteritidis , 2016, Epidemiology and Infection.

[30]  J. Bono,et al.  Short-term evolution of Shiga toxin-producing Escherichia coli O157:H7 between two food-borne outbreaks , 2016, Microbial genomics.

[31]  Thibaut Jombart,et al.  Phylogenetic structure of European Salmonella Enteritidis outbreak correlates with national and international egg distribution network , 2016, Microbial genomics.

[32]  Frank Neumann,et al.  Proceedings of the Genetic and Evolutionary Computation Conference 2016 , 2016, GECCO 2016.

[33]  S. Nair,et al.  Distinct Salmonella Enteritidis lineages associated with enterocolitis in high-income settings and invasive disease in low-income settings , 2016, Nature Genetics.

[34]  Eric D. Ebel,et al.  Comparing Characteristics of Sporadic and Outbreak-Associated Foodborne Illnesses, United States, 2004–2011 , 2016, Emerging infectious diseases.

[35]  Paul Medvedev,et al.  Compacting de Bruijn graphs from sequencing data quickly and in low memory , 2016, Bioinform..

[36]  Claire Jenkins,et al.  Identification of Salmonella for public health surveillance using whole genome sequencing , 2016, PeerJ.

[37]  Randal S. Olson,et al.  Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science , 2016, GECCO.

[38]  Simon R. Harris,et al.  SNP-sites: rapid efficient extraction of SNPs from multi-FASTA alignments , 2016, bioRxiv.

[39]  A. von Haeseler,et al.  IQ-TREE: A Fast and Effective Stochastic Algorithm for Estimating Maximum-Likelihood Phylogenies , 2014, Molecular biology and evolution.

[40]  Björn Usadel,et al.  Trimmomatic: a flexible trimmer for Illumina sequence data , 2014, Bioinform..

[41]  Laura C Rodrigues,et al.  Longitudinal study of infectious intestinal disease in the UK (IID2 study): incidence in the community and presenting to general practice , 2011, Gut.

[42]  Tine Hald,et al.  Attributing the human disease burden of foodborne infections to specific sources. , 2009, Foodborne pathogens and disease.

[43]  Georgios S. Vernikos,et al.  Comparative genome analysis of Salmonella Enteritidis PT4 and Salmonella Gallinarum 287/91 provides insights into evolutionary and host adaptation pathways. , 2008, Genome research.

[44]  Daniel J. Wilson,et al.  Tracing the Source of Campylobacteriosis , 2008, PLoS genetics.

[45]  P. Donnelly,et al.  Inference of population structure using multilocus genotype data. , 2000, Genetics.

[46]  Alex A. Freitas,et al.  A survey of hierarchical classification across different application domains , 2010, Data Mining and Knowledge Discovery.

[47]  Stan Matwin,et al.  Functional Annotation of Genes Using Hierarchical Text Categorization , 2005 .

[48]  B. Spratt,et al.  Recombination and the population structures of bacterial pathogens. , 2001, Annual review of microbiology.