INGOT-DR: an interpretable classifier for predicting drug resistance in M. tuberculosis

Motivation Prediction of drug resistance and identification of its mechanisms in bacteria such as Mycobacterium tuberculosis , the etiological agent of tuberculosis, is a challenging problem. Solving this problem requires a transparent, accurate, and flexible predictive model. The methods currently used for this purpose rarely satisfy all of these criteria. On the one hand, approaches based on testing strains against a catalogue of previously identified mutations often yield poor predictive performance; on the other hand, machine learning techniques typically have higher predictive accuracy, but often lack interpretability and may learn patterns that produce accurate predictions for the wrong reasons. Current interpretable methods may either exhibit a lower accuracy or lack the flexibility needed to generalize them to previously unseen data. Contribution In this paper we propose a novel technique, inspired by group testing and Boolean compressed sensing, which yields highly accurate predictions, interpretable results, and is flexible enough to be optimized for various evaluation metrics at the same time. Results We test the predictive accuracy of our approach on five first-line and seven second-line antibiotics used for treating tuberculosis. We find that it has a higher or comparable accuracy to that of commonly used machine learning models, and is able to identify variants in genes with previously reported association to drug resistance. Our method is intrinsically interpretable, and can be customized for different evaluation metrics. Our implementation is available at github.com/hoomanzabeti/INGOT_DR and can be installed via The Python Package Index (Pypi) under ingotdr . This package is also compatible with most of the tools in the Scikit-learn machine learning library.

[1]  Robin Lougee,et al.  The Common Optimization INterface for Operations Research: Promoting open-source software in the operations research community , 2003, IBM J. Res. Dev..

[2]  Francesc Coll,et al.  A robust SNP barcode for typing Mycobacterium tuberculosis complex strains , 2014, Nature Communications.

[3]  T. Clark,et al.  Machine Learning Predicts Accurately Mycobacterium tuberculosis Drug Resistance From Whole Genome Sequencing Data , 2019, Front. Genet..

[4]  Hideaki Sugawara,et al.  The Sequence Read Archive , 2010, Nucleic Acids Res..

[5]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[6]  Mauricio O. Carneiro,et al.  Scaling accurate genetic variant discovery to tens of thousands of samples , 2017, bioRxiv.

[7]  Houman Owhadi,et al.  A non-adapted sparse approximation of PDEs with stochastic inputs , 2010, J. Comput. Phys..

[8]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[9]  S. Gagneux Ecology and evolution of Mycobacterium tuberculosis , 2018, Nature Reviews Microbiology.

[10]  Lenwood S. Heath,et al.  DeepARG: a deep learning approach for predicting antibiotic resistance genes from metagenomic data , 2017, bioRxiv.

[11]  I. Kohane,et al.  Beyond multidrug resistance: Leveraging rare variants with machine and statistical learning models in Mycobacterium tuberculosis resistance prediction , 2019, EBioMedicine.

[12]  I. Smith,et al.  XDR tuberculosis--implications for global public health. , 2007, The New England journal of medicine.

[13]  R. Lougee-Heimer,et al.  The Common Optimization INterface for Operations Research: Promoting open-source software in the operations research community , 2003 .

[14]  Phelim Bradley,et al.  Whole-genome sequencing for prediction of Mycobacterium tuberculosis drug susceptibility and resistance: a retrospective cohort study , 2015, The Lancet. Infectious diseases.

[15]  S. Borrell,et al.  KvarQ: targeted and direct variant calling from fastq reads of bacterial genomes , 2014, BMC Genomics.

[16]  Ying Cheng,et al.  The European Nucleotide Archive , 2010, Nucleic Acids Res..

[17]  Stefan Niemann,et al.  Mycobacterium tuberculosis resistance prediction and lineage classification from genome sequencing: comparison of automated analysis tools , 2017, Scientific Reports.

[18]  T. Ganiats,et al.  Frequency and Geographic Distribution of gyrA and gyrB Mutations Associated with Fluoroquinolone Resistance in Clinical Mycobacterium Tuberculosis Isolates: A Systematic Review , 2015, PloS one.

[19]  慧 廣瀬 A Mathematical Introduction to Compressive Sensing , 2015 .

[20]  Iain Dunning,et al.  PuLP : A Linear Programming Toolkit for Python , 2011 .

[21]  Hilde van der Togt,et al.  Publisher's Note , 2003, J. Netw. Comput. Appl..

[22]  Matthew Aldridge,et al.  Group testing: an information theory perspective , 2019, Found. Trends Commun. Inf. Theory.

[23]  Ruth McNerney,et al.  A standardised method for interpreting the association between mutations and phenotypic drug resistance in Mycobacterium tuberculosis , 2017, European Respiratory Journal.

[24]  Dmitry M. Malioutov,et al.  Boolean compressed sensing: LP relaxation for group testing , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  B. Shapiro,et al.  Benchmarking bacterial genome-wide association study methods using simulated genomes and phenotypes , 2020, Microbial genomics.

[26]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[27]  Hugh Chen,et al.  From local explanations to global understanding with explainable AI for trees , 2020, Nature Machine Intelligence.

[28]  Matthew Aldridge,et al.  Group Testing Algorithms: Bounds and Simulations , 2013, IEEE Transactions on Information Theory.

[29]  Thomas Strohmer,et al.  High-Resolution Radar via Compressed Sensing , 2008, IEEE Transactions on Signal Processing.

[30]  Yik-Ying Teo,et al.  Genomic prediction of tuberculosis drug-resistance: benchmarking existing databases and prediction algorithms , 2019, BMC Bioinformatics.

[31]  George Atia,et al.  Boolean Compressed Sensing and Noisy Group Testing , 2009, IEEE Transactions on Information Theory.

[32]  Francesc Coll,et al.  Rapid determination of anti-tuberculosis drug resistance from whole-genome sequences , 2015, Genome Medicine.

[33]  Sorin Draghici,et al.  Predicting HIV drug resistance with neural networks , 2003, Bioinform..

[34]  R. DeVore,et al.  Compressed sensing and best k-term approximation , 2008 .

[35]  Phelim Bradley,et al.  Rapid antibiotic-resistance predictions from genome sequence data for Staphylococcus aureus and Mycobacterium tuberculosis , 2015, Nature Communications.

[36]  E.J. Candes,et al.  An Introduction To Compressive Sampling , 2008, IEEE Signal Processing Magazine.

[37]  Yan Zhang,et al.  PATRIC, the bacterial bioinformatics database and analysis resource , 2013, Nucleic Acids Res..

[38]  David A. Clifton,et al.  Application of machine learning techniques to tuberculosis drug resistance analysis , 2018, Bioinform..

[39]  K. Drlica,et al.  DNA gyrase, topoisomerase IV, and the 4-quinolones , 1997, Microbiology and molecular biology reviews : MMBR.

[40]  Scott Lundberg,et al.  A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[41]  Kyle A. Gallivan,et al.  A compressed sensing approach for partial differential equations with random input data , 2012 .

[42]  D. Donoho,et al.  Sparse MRI: The application of compressed sensing for rapid MR imaging , 2007, Magnetic resonance in medicine.

[43]  François Laviolette,et al.  Interpretable genotype-to-phenotype classifiers with performance guarantees , 2018, Scientific Reports.

[44]  Marco Schito,et al.  Collaborative Effort for a Centralized Worldwide Tuberculosis Relational Sequencing Data Platform. , 2015, Clinical infectious diseases : an official publication of the Infectious Diseases Society of America.

[45]  Balas K. Natarajan,et al.  Sparse Approximate Solutions to Linear Systems , 1995, SIAM J. Comput..

[46]  R. Dorfman The Detection of Defective Members of Large Populations , 1943 .

[47]  James B. Brown,et al.  Iterative random forests to discover predictive and stable high-order interactions , 2017, Proceedings of the National Academy of Sciences.

[48]  P. Beckert,et al.  PhyResSE: a Web Tool Delineating Mycobacterium tuberculosis Antibiotic Resistance and Lineage from Whole-Genome Sequencing Data , 2015, Journal of Clinical Microbiology.

[49]  T. Kirikae,et al.  CASTB (the comprehensive analysis server for the Mycobacterium tuberculosis complex): A publicly accessible web server for epidemiological analyses, drug-resistance prediction and phylogenetic comparison of clinical isolates. , 2015, Tuberculosis.

[50]  A. Sterrett On the Detection of Defective Members of Large Populations , 1957 .

[51]  T. Blumensath,et al.  Theory and Applications , 2011 .

[52]  Kush R. Varshney,et al.  Exact Rule Learning via Boolean Compressed Sensing , 2013, ICML.

[53]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[54]  Holger Rauhut,et al.  A Mathematical Introduction to Compressive Sensing , 2013, Applied and Numerical Harmonic Analysis.

[55]  Marta Avalos,et al.  Penalized logistic regression with low prevalence exposures beyond high dimensional settings , 2019, PloS one.

[56]  Yonina C. Eldar,et al.  Compressed Sensing: List of contributors , 2012 .

[57]  Maxime Déraspe,et al.  Predictive computational phenotyping and biomarker discovery using reference-free genome comparisons , 2016, BMC Genomics.

[58]  Yonina C. Eldar,et al.  Structured Compressed Sensing: From Theory to Applications , 2011, IEEE Transactions on Signal Processing.

[59]  Chandan Singh,et al.  Definitions, methods, and applications in interpretable machine learning , 2019, Proceedings of the National Academy of Sciences.

[60]  Shakuntala Baichoo,et al.  Current Affairs of Microbial Genome-Wide Association Studies: Approaches, Bottlenecks and Analytical Pitfalls , 2020, Frontiers in Microbiology.

[61]  David A. Clifton,et al.  Machine learning for classifying tuberculosis drug-resistance from DNA sequencing data , 2017, Bioinform..

[62]  X. Chen,et al.  Random forests for genomic data analysis. , 2012, Genomics.

[63]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[64]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[65]  Cwi Cwi Overview research activities 1996 / Centrum voor Wiskunde en Informatica (CWI) , 1995 .