Interpretable Classification of Bacterial Raman Spectra With Knockoff Wavelets

Deep neural networks and other machine learning models are widely applied to biomedical signal data because they can detect complex patterns and compute accurate predictions. However, the difficulty of interpreting such models is a limitation, especially for applications involving high-stakes decision, including the identification of bacterial infections. This paper considers fast Raman spectroscopy data and demonstrates that a logistic regression model with carefully selected features achieves accuracy comparable to that of neural networks, while being much simpler and more transparent. Our analysis leverages wavelet features with intuitive chemical interpretations, and performs controlled variable selection with knockoffs to ensure the predictors are relevant and non-redundant. Although we focus on a particular data set, the proposed approach is broadly applicable to other types of signal data for which interpretability may be important.

[1]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[2]  FDR control in GWAS with population structure , 2021 .

[3]  Brendt Wohlberg,et al.  Convolutional Dictionary Learning: A Comparative Review and New Algorithms , 2017, IEEE Transactions on Computational Imaging.

[4]  Ning Wang,et al.  Extracting and Selecting Distinctive EEG Features for Efficient Epileptic Seizure Prediction , 2015, IEEE Journal of Biomedical and Health Informatics.

[5]  E. Candès,et al.  Near-ideal model selection by ℓ1 minimization , 2008, 0801.0345.

[6]  R. Venkatesh Babu,et al.  Training Sparse Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[7]  S. Mallat A wavelet tour of signal processing , 1998 .

[8]  Lucas Janson,et al.  Panning for gold: ‘model‐X’ knockoffs for high dimensional controlled variable selection , 2016, 1610.02351.

[9]  William Stafford Noble,et al.  DeepPINK: reproducible feature selection in deep neural networks , 2018, NeurIPS.

[10]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[11]  Antanas Verikas,et al.  Feature selection with neural networks , 2002, Pattern Recognit. Lett..

[12]  Chandan Singh,et al.  Definitions, methods, and applications in interpretable machine learning , 2019, Proceedings of the National Academy of Sciences.

[13]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[14]  Guang-Zhong Yang,et al.  Deep Learning for Health Informatics , 2017, IEEE Journal of Biomedical and Health Informatics.

[15]  Emmanuel J. Candès,et al.  Multi-resolution localization of causal variants across the genome , 2019, Nature Communications.

[16]  Georg Heinze,et al.  Variable selection – A review and recommendations for the practicing statistician , 2018, Biometrical journal. Biometrische Zeitschrift.

[17]  Jean-Philippe Vert,et al.  Consistency of Random Forests , 2014, 1405.2881.

[18]  Anita Mahadevan-Jansen,et al.  Drug-Resistant Staphylococcus aureus Strains Reveal Distinct Biochemical Features with Raman Microspectroscopy. , 2018, ACS infectious diseases.

[19]  M Sesia,et al.  Gene hunting with hidden Markov model knockoffs , 2017, Biometrika.

[20]  Ali Borji,et al.  Saliency Prediction in the Deep Learning Era: Successes and Limitations , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Gerhard Tutz,et al.  Variable selection in general multinomial logit models , 2015, Comput. Stat. Data Anal..

[22]  Stefano Ermon,et al.  Rapid identification of pathogenic bacteria using Raman spectroscopy and deep learning , 2019, Nature Communications.

[23]  Tim Miller,et al.  Explanation in Artificial Intelligence: Insights from the Social Sciences , 2017, Artif. Intell..

[24]  Richard L. McCreery,et al.  Raman Spectroscopy for Chemical Analysis , 2000 .

[25]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[26]  R. Coifman,et al.  Fast wavelet transforms and numerical algorithms I , 1991 .

[27]  Ferat Sahin,et al.  A survey on feature selection methods , 2014, Comput. Electr. Eng..

[28]  Chiara Sabatti,et al.  MULTILAYER KNOCKOFF FILTER: CONTROLLED VARIABLE SELECTION AT MULTIPLE RESOLUTIONS. , 2017, The annals of applied statistics.

[29]  E. Candès,et al.  Controlling the false discovery rate via knockoffs , 2014, 1404.5609.

[30]  Peng Zhao,et al.  On Model Selection Consistency of Lasso , 2006, J. Mach. Learn. Res..

[31]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[32]  Sridhar Krishnan,et al.  Trends in biomedical signal feature extraction , 2018, Biomed. Signal Process. Control..

[33]  Bertrand Thirion,et al.  ECKO: Ensemble of Clustered Knockoffs for Robust Multivariate Inference on fMRI Data , 2019, IPMI.

[34]  Xiaomei Li,et al.  Knockoff filter‐based feature selection for discrimination of non‐small cell lung cancer in CT image , 2019, IET Image Processing.

[35]  Scott Lundberg,et al.  A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[36]  Yingying Fan,et al.  IPAD: Stable Interpretable Forecasting with Knockoffs Inference , 2018, Journal of the American Statistical Association.

[37]  James Y. Zou,et al.  Knockoffs for the mass: new feature importance statistics with false discovery guarantees , 2018, AISTATS.

[38]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.