Improved Prediction of Bacterial Genotype-Phenotype Associations Using Interpretable Pangenome-Spanning Regressions

Being able to identify the genetic variants responsible for specific bacterial phenotypes has been the goal of bacterial genetics since its inception and is fundamental to our current level of understanding of bacteria. This identification has been based primarily on painstaking experimentation, but the availability of large data sets of whole genomes with associated phenotype metadata promises to revolutionize this approach, not least for important clinical phenotypes that are not amenable to laboratory analysis. These models of phenotype-genotype association can in the future be used for rapid prediction of clinically important phenotypes such as antibiotic resistance and virulence by rapid-turnaround or point-of-care tests. However, despite much effort being put into adapting genome-wide association study (GWAS) approaches to cope with bacterium-specific problems, such as strong population structure and horizontal gene exchange, current approaches are not yet optimal. We describe a method that advances methodology for both association and generation of portable prediction models. ABSTRACT Discovery of genetic variants underlying bacterial phenotypes and the prediction of phenotypes such as antibiotic resistance are fundamental tasks in bacterial genomics. Genome-wide association study (GWAS) methods have been applied to study these relations, but the plastic nature of bacterial genomes and the clonal structure of bacterial populations creates challenges. We introduce an alignment-free method which finds sets of loci associated with bacterial phenotypes, quantifies the total effect of genetics on the phenotype, and allows accurate phenotype prediction, all within a single computationally scalable joint modeling framework. Genetic variants covering the entire pangenome are compactly represented by extended DNA sequence words known as unitigs, and model fitting is achieved using elastic net penalization, an extension of standard multiple regression. Using an extensive set of state-of-the-art bacterial population genomic data sets, we demonstrate that our approach performs accurate phenotype prediction, comparable to popular machine learning methods, while retaining both interpretability and computational efficiency. Compared to those of previous approaches, which test each genotype-phenotype association separately for each variant and apply a significance threshold, the variants selected by our joint modeling approach overlap substantially. IMPORTANCE Being able to identify the genetic variants responsible for specific bacterial phenotypes has been the goal of bacterial genetics since its inception and is fundamental to our current level of understanding of bacteria. This identification has been based primarily on painstaking experimentation, but the availability of large data sets of whole genomes with associated phenotype metadata promises to revolutionize this approach, not least for important clinical phenotypes that are not amenable to laboratory analysis. These models of phenotype-genotype association can in the future be used for rapid prediction of clinically important phenotypes such as antibiotic resistance and virulence by rapid-turnaround or point-of-care tests. However, despite much effort being put into adapting genome-wide association study (GWAS) approaches to cope with bacterium-specific problems, such as strong population structure and horizontal gene exchange, current approaches are not yet optimal. We describe a method that advances methodology for both association and generation of portable prediction models.

[1]  Vincent Lacroix,et al.  A fast and agnostic method for bacterial genome-wide association studies: Bridging the gap between k-mers and genetic events , 2018, PLoS genetics.

[2]  M. Feldman,et al.  Analysis of polygenic risk score usage and performance in diverse human populations , 2019, Nature Communications.

[3]  Daniel J. Wilson,et al.  Panton-Valentine leucocidin is the key determinant of Staphylococcus aureus 1 pyomyositis in a bacterial GWAS 2 , 2019 .

[4]  A. Oliver,et al.  Predicting antimicrobial resistance in Pseudomonas aeruginosa with machine learning‐enabled molecular diagnostics , 2019, bioRxiv.

[5]  Keith A. Jolley,et al.  Genome-wide association study identifies vitamin B5 biosynthesis as a host specificity factor in Campylobacter , 2013, Proceedings of the National Academy of Sciences.

[6]  James Hadfield,et al.  Phandango: an interactive viewer for bacterial population genomics , 2017, bioRxiv.

[7]  Jukka Corander,et al.  Diversification of Colonization Factors in a Multidrug-Resistant Escherichia coli Lineage Evolving under Negative Frequency-Dependent Selection , 2019, mBio.

[8]  W. Hanage,et al.  Comprehensive Identification of Single Nucleotide Polymorphisms Associated with Beta-lactam Resistance within Pneumococcal Mosaic Genes , 2014, PLoS genetics.

[9]  A. Zwinderman,et al.  Joint sequencing of human and pathogen genomes reveals the genetics of pneumococcal meningitis , 2018, Nature Communications.

[10]  Jukka Corander,et al.  pyseer: a comprehensive tool for microbial pangenome-wide association studies , 2018, bioRxiv.

[11]  C. Whitney,et al.  Penicillin-Binding Protein Transpeptidase Signatures for Tracking and Predicting β-Lactam Resistance Levels in Streptococcus pneumoniae , 2016, mBio.

[12]  François Laviolette,et al.  Interpretable genotype-to-phenotype classifiers with performance guarantees , 2018, Scientific Reports.

[13]  Sanyou Zeng,et al.  Evolvable Systems: From Biology to Hardware, 7th International Conference, ICES 2007, Wuhan, China, September 21-23, 2007, Proceedings , 2007, ICES.

[14]  Tatum D. Mortimer,et al.  Adaptation to the cervical environment is associated with increased antibiotic susceptibility in Neisseria gonorrhoeae , 2020, Nature Communications.

[15]  N. Croucher,et al.  Genomic epidemiology of penicillin-non-susceptible Streptococcus pneumoniae , 2019, Microbial genomics.

[16]  T. Clark,et al.  Discordant bioinformatic predictions of antimicrobial resistance from whole-genome sequencing data of bacterial isolates: an inter-laboratory study , 2019, bioRxiv.

[17]  J. Cannon,et al.  Physical map of the chromosome of Neisseria gonorrhoeae FA1090 with locations of genetic markers, including opa and pil genes , 1991, Journal of bacteriology.

[18]  Julian Parkhill,et al.  Population genomic datasets describing the post-vaccine evolutionary epidemiology of Streptococcus pneumoniae , 2015, Scientific Data.

[19]  Anna G. Green,et al.  Genomic Epidemiology of Gonococcal Resistance to Extended-Spectrum Cephalosporins, Macrolides, and Fluoroquinolones in the United States, 2000–2013 , 2016, The Journal of infectious diseases.

[20]  J. Corander,et al.  Genome-wide epistasis and co-selection study using mutual information , 2019, bioRxiv.

[21]  Dominique Lavenier,et al.  GATB: Genome Assembly & Analysis Tool Box , 2014, Bioinform..

[22]  Andries J. van Tonder,et al.  Pneumococcal lineages associated with serotype replacement and antibiotic resistance in childhood invasive pneumococcal disease in the post-PCV13 era: an international whole-genome sequencing study , 2019, The Lancet. Infectious diseases.

[23]  A. Mignan,et al.  One neuron versus deep learning in aftershock prediction , 2019, Nature.

[24]  Páll Melsted,et al.  Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs , 2019, Genome Biology.

[25]  Justin Zobel,et al.  Performance and Robustness of Penalized and Unpenalized Methods for Genetic Prediction of Complex Human Disease , 2013, Genetic epidemiology.

[26]  R. Tibshirani,et al.  A SIGNIFICANCE TEST FOR THE LASSO. , 2013, Annals of statistics.

[27]  P. Waldmann,et al.  Evaluation of the lasso and the elastic net in genome-wide association studies , 2013, Front. Genet..

[28]  P. Visscher,et al.  Modeling Linkage Disequilibrium Increases Accuracy of Polygenic Risk Scores , 2015, bioRxiv.

[29]  G. Dougan,et al.  Atlas of group A streptococcal vaccine candidates compiled using large-scale comparative genomics , 2019, Nature Genetics.

[30]  Xavier Didelot,et al.  A phylogenetic method to perform genome-wide association studies in microbes that accounts for population structure and recombination , 2017, bioRxiv.

[31]  Daniel J. Wilson,et al.  Whole-genome sequencing to determine transmission of Neisseria gonorrhoeae: an observational study. , 2016, The Lancet. Infectious diseases.

[32]  Evan M. Cofer,et al.  Selene: a PyTorch-based deep learning library for sequence data , 2019, Nature Methods.

[33]  Phelim Bradley,et al.  DNA Sequencing Predicts 1st-Line Tuberculosis Drug Susceptibility Profiles , 2018, The New England journal of medicine.

[34]  B. Shapiro,et al.  Benchmarking bacterial genome-wide association study methods using simulated genomes and phenotypes , 2020, Microbial genomics.

[35]  J. Barrett,et al.  Strategies for fine-mapping complex traits , 2015, Human molecular genetics.

[36]  Debora S. Marks,et al.  Genome-wide discovery of epistatic loci affecting antibiotic resistance in Neisseria gonorrhoeae using evolutionary couplings , 2018, Nature Microbiology.

[37]  Peter E. Chen,et al.  The advent of genome-wide association studies for bacteria. , 2015, Current opinion in microbiology.

[38]  Ina Hoeschele,et al.  Penalized Multimarker vs. Single-Marker Regression Methods for Genome-Wide Association Studies of Quantitative Traits , 2014, Genetics.

[39]  Knut Reinert,et al.  SeqAn An efficient, generic C++ library for sequence analysis , 2008, BMC Bioinformatics.

[40]  L. Sánchez-Busó,et al.  The novel 2016 WHO Neisseria gonorrhoeae reference strains for global quality assurance of laboratory investigations: phenotypic, genetic and reference genome characterization. , 2016, The Journal of antimicrobial chemotherapy.

[41]  Björn Usadel,et al.  Trimmomatic: a flexible trimmer for Illumina sequence data , 2014, Bioinform..

[42]  M. Quail,et al.  Role of Conjugative Elements in the Evolution of the Multidrug-Resistant Pandemic Clone Streptococcus pneumoniaeSpain23F ST81 , 2008, Journal of bacteriology.

[43]  Hailiang Huang,et al.  Fine-mapping inflammatory bowel disease loci to single variant resolution , 2017, Nature.

[44]  Aldert L. Zomer,et al.  Transmissible Mycobacterium tuberculosis Strains Share Genetic Markers and Immune Phenotypes , 2017, American journal of respiratory and critical care medicine.

[45]  Trevor J. Hastie,et al.  Genome-wide association analysis by lasso penalized logistic regression , 2009, Bioinform..

[46]  Sameer Singh,et al.  “Why Should I Trust You?”: Explaining the Predictions of Any Classifier , 2016, NAACL.

[47]  Arcadi Navarro,et al.  Assessing statistical significance in multivariable genome wide association analysis , 2016, Bioinform..

[48]  I. Kohane,et al.  Beyond multidrug resistance: Leveraging rare variants with machine and statistical learning models in Mycobacterium tuberculosis resistance prediction , 2019, EBioMedicine.

[49]  A. Oliver,et al.  Predicting antimicrobial resistance in Pseudomonas aeruginosa with machine learning‐enabled molecular diagnostics , 2020, EMBO molecular medicine.

[50]  W Y Zhang,et al.  Discussion on `Sure independence screening for ultra-high dimensional feature space' by Fan, J and Lv, J. , 2008 .

[51]  E. Denamur,et al.  Major role of the high-pathogenicity island (HPI) in the intrinsic extra-intestinal virulence of Escherichia coli revealed by a genome-wide association study , 2019, bioRxiv.

[52]  Leonor Sánchez-Busó,et al.  Evaluation of parameters affecting performance and reliability of machine learning-based antibiotic susceptibility testing from whole genome sequencing data , 2019, PLoS computational biology.

[53]  Leopold Parts,et al.  Prediction of antibiotic resistance in Escherichia coli from large-scale pan-genome data , 2018, PLoS Comput. Biol..

[54]  Igor Jurisica,et al.  Optimized application of penalized regression methods to diverse genomic data , 2011, Bioinform..

[55]  Bernhard O Palsson,et al.  Machine learning and structural analysis of Mycobacterium tuberculosis pan-genome identifies genetic signatures of antibiotic resistance , 2018, Nature Communications.

[56]  Jukka Corander,et al.  Fast and flexible bacterial genomic epidemiology with PopPUNK , 2018, bioRxiv.

[57]  G. Fu,et al.  Exploiting Linkage Disequilibrium for Ultrahigh-Dimensional Genome-Wide Data with an Integrated Statistical Approach , 2015, Genetics.

[58]  Tatum D. Mortimer,et al.  Increased antibiotic susceptibility in Neisseria gonorrhoeae through adaptation to the cervical environment , 2020, bioRxiv.

[59]  Marco Broccardo,et al.  One neuron versus deep learning in aftershock prediction , 2019, Nature.

[60]  Jukka Corander,et al.  Sequence element enrichment analysis to determine the genetic basis of bacterial phenotypes , 2016, Nature Communications.

[61]  Eran Halperin,et al.  Using Stochastic Approximation Techniques to Efficiently Construct Confidence Intervals for Heritability , 2017, RECOMB.

[62]  David A. Clifton,et al.  Identifying lineage effects when controlling for population structure improves power in bacterial association studies , 2015, Nature Microbiology.

[63]  G. Horsman,et al.  Whole-Genome Phylogenomic Heterogeneity of Neisseria gonorrhoeae Isolates with Decreased Cephalosporin Susceptibility Collected in Canada between 1989 and 2013 , 2014, Journal of Clinical Microbiology.

[64]  M. Lipsitch,et al.  D R A F T The evolution of antibiotic resistance is linked to any genetic mechanism affecting bacterial duration of carriage , 2016 .

[65]  D. Gianola,et al.  Genome-Wide Association Studies with a Genomic Relationship Matrix: A Case Study with Wheat and Arabidopsis , 2016, G3: Genes, Genomes, Genetics.

[66]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[67]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[68]  O. Lund,et al.  Understanding and predicting ciprofloxacin minimum inhibitory concentration in Escherichia coli with machine learning , 2019, Scientific Reports.

[69]  Cun-Hui Zhang,et al.  The sparsity and bias of the Lasso selection in high-dimensional linear regression , 2008, 0808.0967.

[70]  Mark Alan Fontana,et al.  Publisher Correction: Multi-trait analysis of genome-wide association summary statistics using MTAG , 2019, Nature Genetics.

[71]  T. Dallman,et al.  Patchy promiscuity: machine learning applied to predict the host specificity of Salmonella enterica and Escherichia coli , 2017, Microbial genomics.

[72]  David Goldblatt,et al.  Genome-wide identification of lineage and locus specific variation associated with pneumococcal carriage duration , 2017, bioRxiv.

[73]  Christine B. Peterson,et al.  Controlling the Rate of GWAS False Discoveries , 2016, Genetics.

[74]  Jukka Corander,et al.  International genomic definition of pneumococcal lineages, to contextualise disease, antibiotic resistance and vaccine impact , 2019, EBioMedicine.

[75]  Giovanni Manzini,et al.  Indexing compressed text , 2005, JACM.

[76]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[77]  N. Wheeler Tracing outbreaks with machine learning , 2019, Nature Reviews Microbiology.

[78]  Adrian Thompson,et al.  An Evolved Circuit, Intrinsic in Silicon, Entwined with Physics , 1996, ICES.

[79]  Thomas P. Quinn,et al.  Another look at microbe–metabolite interactions: how scale invariant correlations can outperform a neural network , 2019, bioRxiv.

[80]  J. Parkhill,et al.  Contrasting approaches to genome-wide association studies impact the detection of resistance mechanisms in Staphylococcus aureus , 2019, bioRxiv.

[81]  Trevor Hastie,et al.  A Fast and Flexible Algorithm for Solving the Lasso in Large-scale and Ultrahigh-dimensional Problems , 2019 .

[82]  M. Gutmann,et al.  Frequency-dependent selection in vaccine-associated pneumococcal population dynamics , 2017, Nature Ecology & Evolution.

[83]  Pierre Mahé,et al.  Predicting bacterial resistance from whole-genome sequences using k-mers and stability selection , 2018, BMC Bioinformatics.

[84]  Peter Norvig,et al.  The Unreasonable Effectiveness of Data , 2009, IEEE Intelligent Systems.

[85]  M. Huss,et al.  A primer on deep learning in genomics , 2018, Nature Genetics.

[86]  X. Didelot,et al.  A phylogenetic method to perform genome-wide association studies in microbes that accounts for population structure and recombination , 2017, bioRxiv.

[87]  C. Sander,et al.  Direct-coupling analysis of residue coevolution captures native contacts across many protein families , 2011, Proceedings of the National Academy of Sciences.

[88]  Jukka Corander,et al.  SuperDCA for genome-wide epistasis analysis , 2017, bioRxiv.

[89]  P. Visscher,et al.  GCTA: a tool for genome-wide complex trait analysis. , 2011, American journal of human genetics.

[90]  Sergey I. Nikolenko,et al.  SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing , 2012, J. Comput. Biol..

[91]  G. Rossolini,et al.  Mechanisms of Antibacterial Resistance , 2016 .