Predicting bacterial resistance from whole-genome sequences using k-mers and stability selection

BackgroundSeveral studies demonstrated the feasibility of predicting bacterial antibiotic resistance phenotypes from whole-genome sequences, the prediction process usually amounting to detecting the presence of genes involved in antibiotic resistance mechanisms, or of specific mutations, previously identified from a training panel of strains, within these genes. We address the problem from the supervised statistical learning perspective, not relying on prior information about such resistance factors. We rely on a k-mer based genotyping scheme and a logistic regression model, thereby combining several k-mers into a probabilistic model. To identify a small yet predictive set of k-mers, we rely on the stability selection approach (Meinshausen et al., J R Stat Soc Ser B 72:417–73, 2010), that consists in penalizing logistic regression models with a Lasso penalty, coupled with extensive resampling procedures.ResultsUsing public datasets, we applied the resulting classifiers to two bacterial species and achieved predictive performance equivalent to state of the art. The models are extremely sparse, involving 1 to 8 k-mers per antibiotic, hence are remarkably easy and fast to evaluate on new genomes (from raw reads to assemblies).ConclusionOur proof of concept therefore demonstrates that stability selection is a powerful approach to investigate bacterial genotype-phenotype relationships.

[1]  Murat Dundar,et al.  Learning Classifiers When the Training Data Is Not IID , 2007, IJCAI.

[2]  Ying Zhang,et al.  Mechanisms of drug resistance in Mycobacterium tuberculosis. , 2009, Frontiers in bioscience : a journal and virtual library.

[3]  Eric van der Helm,et al.  Rapid resistome mapping using nanopore sequencing , 2016, bioRxiv.

[4]  Dongfang Li,et al.  Genome sequencing of 161 Mycobacterium tuberculosis isolates from China identifies genes and intergenic regions associated with drug resistance , 2013, Nature Genetics.

[5]  N. Meinshausen,et al.  Stability selection , 2008, 0809.2932.

[6]  Chrystala Constantinidou,et al.  Genome sequencing in clinical microbiology , 2012, Nature Biotechnology.

[7]  Francesc Coll,et al.  Rapid determination of anti-tuberculosis drug resistance from whole-genome sequences , 2015, Genome Medicine.

[8]  Alexandre d'Aspremont,et al.  On Learning Matrices with Orthogonal Columns or Disjoint Supports , 2014, ECML/PKDD.

[9]  S. Cole,et al.  Mechanisms of drug resistance in Mycobacterium tuberculosis. , 1994, Immunobiology.

[10]  J. Palomino,et al.  Drug Resistance Mechanisms in Mycobacterium tuberculosis , 2014, Antibiotics.

[11]  Francis R. Bach,et al.  Bolasso: model consistent Lasso estimation through the bootstrap , 2008, ICML '08.

[12]  Tobias Bergmiller,et al.  Biased partitioning of the multidrug efflux pump AcrAB-TolC underlies long-lived phenotypic heterogeneity , 2017, Science.

[13]  Daniel J. Wilson,et al.  Prediction of Staphylococcus aureus Antimicrobial Resistance by Whole-Genome Sequencing , 2014, Journal of Clinical Microbiology.

[14]  M. Luck,et al.  Genome sequencing , 1987, Nature.

[15]  Daniel J. Wilson,et al.  Transforming clinical microbiology with bacterial genome sequencing , 2012, Nature Reviews Genetics.

[16]  Roy Kishony,et al.  Understanding, predicting and manipulating the genotypic evolution of antibiotic resistance , 2013, Nature Reviews Genetics.

[17]  Phelim Bradley,et al.  Rapid antibiotic-resistance predictions from genome sequence data for Staphylococcus aureus and Mycobacterium tuberculosis , 2015, Nature Communications.

[18]  Stefan Niemann,et al.  Mycobacterium tuberculosis resistance prediction and lineage classification from genome sequencing: comparison of automated analysis tools , 2017, Scientific Reports.

[19]  Jukka Corander,et al.  Sequence element enrichment analysis to determine the genetic basis of bacterial phenotypes , 2016, Nature Communications.

[20]  Fangfang Xia,et al.  Antimicrobial Resistance Prediction in PATRIC and RAST , 2016, Scientific Reports.

[21]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[22]  F. Raymond,et al.  which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Ray Meta: scalable de novo metagenome assembly and profiling , 2012 .

[23]  Yonatan H. Grad,et al.  WGS to predict antibiotic MICs for Neisseria gonorrhoeae , 2017, The Journal of antimicrobial chemotherapy.

[24]  P. McCullagh Regression Models for Ordinal Data , 1980 .

[25]  Hum Nath Jnawali,et al.  First– and Second–Line Drugs and Drug Resistance , 2013 .

[26]  Yi Xing,et al.  Negative selection pressure against premature protein truncation is reduced by both alternative splicing and diploidy , 2004, Genome Biology.

[27]  B. Kégl,et al.  Genome-wide analysis captures the determinants of the antibiotic cross-resistance interaction network , 2014, Nature Communications.

[28]  Tso-Jung Yen,et al.  Discussion on "Stability Selection" by Meinshausen and Buhlmann , 2010 .

[29]  Gang Sun,et al.  Association of gyrA/B mutations and resistance levels to fluoroquinolones in clinical isolates of Mycobacterium tuberculosis , 2014, Emerging Microbes & Infections.

[30]  Paul Medvedev,et al.  Compacting de Bruijn graphs from sequencing data quickly and in low memory , 2016, Bioinform..

[31]  S. Salzberg,et al.  Versatile and open software for comparing large genomes , 2004, Genome Biology.

[32]  Bin Yu,et al.  Estimation Stability With Cross-Validation (ESCV) , 2013, 1303.3128.

[33]  Eric P. Xing,et al.  Tree-Guided Group Lasso for Multi-Task Regression with Structured Sparsity , 2009, ICML.

[34]  C. Bertelli,et al.  Rapid bacterial genome sequencing: methods and applications in clinical microbiology. , 2013, Clinical microbiology and infection : the official publication of the European Society of Clinical Microbiology and Infectious Diseases.

[35]  Phelim Bradley,et al.  Same-Day Diagnostic and Surveillance Data for Tuberculosis via Whole-Genome Sequencing of Direct Respiratory Samples , 2016, Journal of Clinical Microbiology.

[36]  David A. Clifton,et al.  Machine learning for classifying tuberculosis drug-resistance from DNA sequencing data , 2017, Bioinform..

[37]  David A. Clifton,et al.  Identifying lineage effects when controlling for population structure improves power in bacterial association studies , 2015, Nature Microbiology.

[38]  Maxime Déraspe,et al.  Predictive computational phenotyping and biomarker discovery using reference-free genome comparisons , 2016, BMC Genomics.

[39]  N. Loman,et al.  High-throughput bacterial genome sequencing: an embarrassment of choice, a world of opportunity , 2012, Nature Reviews Microbiology.

[40]  Phelim Bradley,et al.  Whole-genome sequencing for prediction of Mycobacterium tuberculosis drug susceptibility and resistance: a retrospective cohort study , 2015, The Lancet. Infectious diseases.