Reliable and Trustworthy Machine Learning for Health Using Dataset Shift Detection

Unpredictable ML model behavior on unseen data, especially in the health domain, raises serious concerns about its safety as repercussions for mistakes can be fatal. In this paper, we explore the feasibility of using state-of-the-art out-of-distribution detectors for reliable and trustworthy diagnostic predictions. We select publicly available deep learning models relating to various health conditions (e.g., skin cancer, lung sound, and Parkinson’s disease) using various input data types (e.g., image, audio, and motion data). We demonstrate that these models show unreasonable predictions on out-of-distribution datasets. We show that Mahalanobis distanceand Gram matrices-based out-of-distribution detection methods are able to detect out-of-distribution data with high accuracy for the health models that operate on different modalities. We then translate the out-of-distribution score into a human interpretable CONFIDENCE SCORE to investigate its effect on the users’ interaction with health ML applications. Our user study shows that the CONFIDENCE SCORE helped the participants only trust the results with a high score to make a medical decision and disregard results with a low score. Through this work, we demonstrate that dataset shift is a critical piece of information for high-stake ML applications, such as medical diagnosis and healthcare, to provide reliable and trustworthy predictions to the users.

[1]  Jessica A. Chen,et al.  Safety concerns with consumer-facing mobile health applications and their consequences: a scoping review , 2019, AMIA.

[2]  Thomas G. Dietterich,et al.  Deep Anomaly Detection with Outlier Exposure , 2018, ICLR.

[3]  Lisa M. DeBruine,et al.  Face Research Lab London Set , 2017 .

[4]  Xin Liu,et al.  Multi-Task Temporal Shift Attention Networks for On-Device Contactless Vitals Measurement , 2020, NeurIPS.

[5]  Chandramouli Shama Sastry,et al.  On Out-of-Distribution Detection Algorithms with Deep Neural Skin Cancer Classifiers , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[6]  Geraint Rees,et al.  Clinically applicable deep learning for diagnosis and referral in retinal disease , 2018, Nature Medicine.

[7]  Been Kim,et al.  Sanity Checks for Saliency Maps , 2018, NeurIPS.

[8]  Maya R. Gupta,et al.  To Trust Or Not To Trust A Classifier , 2018, NeurIPS.

[9]  Igor M. Quintanilha,et al.  Detecting Out-Of-Distribution Samples Using Low-Order Deep Features Statistics , 2018 .

[10]  Fei Wang,et al.  Deep learning in mental health outcome research: a scoping review , 2020, Translational Psychiatry.

[11]  Shwetak N. Patel,et al.  BiliScreen: Smartphone-Based Scleral Jaundice Monitoring for Liver and Pancreatic Disorders , 2017, Proc. ACM Interact. Mob. Wearable Ubiquitous Technol..

[12]  Charles Blundell,et al.  Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[13]  Deepta Rajan,et al.  Calibrating Healthcare AI: Towards Reliable and Interpretable Deep Predictive Models , 2020, ArXiv.

[14]  Eric C. Larson,et al.  BiliCam: using mobile phones to monitor newborn jaundice , 2014, UbiComp Adjunct.

[15]  R. Rosenthal Meta-analytic procedures for social research , 1984 .

[16]  Nic Ford,et al.  Adversarial Examples Are a Natural Consequence of Test Error in Noise , 2019, ICML.

[17]  Susan Athey,et al.  Beyond prediction: Using big data for policy problems , 2017, Science.

[18]  Ivan Evtimov,et al.  Security and Machine Learning in the Real World , 2020, ArXiv.

[19]  Kibok Lee,et al.  A Simple Unified Framework for Detecting Out-of-Distribution Samples and Adversarial Attacks , 2018, NeurIPS.

[20]  Noel C. F. Codella,et al.  Skin Lesion Analysis Toward Melanoma Detection 2018: A Challenge Hosted by the International Skin Imaging Collaboration (ISIC) , 2019, ArXiv.

[21]  Nipun Kwatra,et al.  RespireNet: A Deep Neural Network for Accurately Detecting Abnormal Lung Sounds in Limited Data Setting , 2020, 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC).

[22]  Noel C. F. Codella,et al.  Skin lesion analysis toward melanoma detection: A challenge at the 2017 International symposium on biomedical imaging (ISBI), hosted by the international skin imaging collaboration (ISIC) , 2016, 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018).

[23]  Bram van Ginneken,et al.  FRODO: Free rejection of out-of-distribution samples: application to chest x-ray analysis , 2019, ArXiv.

[24]  Mark J. F. Gales,et al.  Predictive Uncertainty Estimation via Prior Networks , 2018, NeurIPS.

[25]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[26]  Zhangyang Wang,et al.  Self-Supervised Learning for Generalizable Out-of-Distribution Detection , 2020, AAAI.

[27]  Héctor Pomares,et al.  mHealthDroid: A Novel Framework for Agile Development of Mobile Health Applications , 2014, IWAAL.

[28]  Paolo Bonato,et al.  Crowdsourcing digital health measures to predict Parkinson’s disease severity: the Parkinson’s Disease Digital Biomarker DREAM Challenge , 2020, bioRxiv.

[29]  Alexander J. Smola,et al.  Detecting and Correcting for Label Shift with Black Box Predictors , 2018, ICML.

[30]  Alexander A. Alemi,et al.  Density of States Estimation for Out-of-Distribution Detection , 2020, ArXiv.

[31]  Shwetak N. Patel,et al.  PupilScreen , 2017, Proc. ACM Interact. Mob. Wearable Ubiquitous Technol..

[32]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  A. Lemkaddem,et al.  Blood pressure measurements with the OptiBP smartphone app validated against reference auscultatory measurements , 2020, Scientific Reports.

[34]  R. Srikant,et al.  Enhancing The Reliability of Out-of-distribution Image Detection in Neural Networks , 2017, ICLR.

[35]  Solon Barocas,et al.  Mitigating Bias in Algorithmic Employment Screening: Evaluating Claims and Practices , 2019, SSRN Electronic Journal.

[36]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[37]  Martin Wattenberg,et al.  Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV) , 2017, ICML.

[38]  Atul Prakash,et al.  Robust Physical-World Attacks on Deep Learning Visual Classification , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[39]  Kibok Lee,et al.  Training Confidence-calibrated Classifiers for Detecting Out-of-Distribution Samples , 2017, ICLR.

[40]  L. Fraiwan,et al.  A dataset of lung sounds recorded from the chest wall using an electronic stethoscope , 2021, Data in brief.

[41]  Harald Kittler,et al.  Descriptor : The HAM 10000 dataset , a large collection of multi-source dermatoscopic images of common pigmented skin lesions , 2018 .

[42]  Yee Whye Teh,et al.  Detecting Out-of-Distribution Inputs to Deep Generative Models Using a Test for Typicality , 2019, ArXiv.

[43]  Shekoofeh Azizi,et al.  Does Your Dermatology Classifier Know What It Doesn't Know? Detecting the Long-Tail of Unseen Conditions , 2021, Medical Image Anal..

[44]  Andrea Cavallaro,et al.  Mobile Sensor Data Anonymization , 2019 .

[45]  Harmanpreet Kaur,et al.  Interpreting Interpretability: Understanding Data Scientists' Use of Interpretability Tools for Machine Learning , 2020, CHI.

[46]  Yuanfang Guan,et al.  Deep Learning Identifies Digital Biomarkers for Self-Reported Parkinson's Disease , 2020, Patterns.

[47]  Suchi Saria,et al.  From development to deployment: dataset shift, causality, and shift-stable models in health AI. , 2019, Biostatistics.

[48]  Ang Li,et al.  Hybrid Models for Open Set Recognition , 2020, ECCV.

[49]  Vitaly Shmatikov,et al.  Exploiting Unintended Feature Leakage in Collaborative Learning , 2018, 2019 IEEE Symposium on Security and Privacy (SP).

[50]  C. Tufanaru,et al.  Health Belief Model , 2009 .

[51]  S. Friend,et al.  The mPower study, Parkinson disease mobile data collected using ResearchKit , 2016, Scientific Data.

[52]  Bernhard Schölkopf,et al.  Domain Adaptation under Target and Conditional Shift , 2013, ICML.

[53]  Thomas G. Dietterich,et al.  Benchmarking Neural Network Robustness to Common Corruptions and Perturbations , 2018, ICLR.

[54]  R. Dobson,et al.  Characterisation of mental health conditions in social media using Informed Deep Learning , 2017, Scientific Reports.

[55]  Eric C. Larson,et al.  Smartphone camera oximetry in an induced hypoxemia study , 2021, npj Digital Medicine.

[56]  Eduard Fosch Villaronga,et al.  Transparency you can trust: Transparency requirements for artificial intelligence between legal norms and contextual concerns , 2019, Big Data Soc..

[57]  Joseph Paul Cohen,et al.  A Benchmark of Medical Out of Distribution Detection , 2020, ArXiv.

[58]  Miroslav Dudík,et al.  Fair Regression: Quantitative Definitions and Reduction-based Algorithms , 2019, ICML.

[59]  Weitang Liu,et al.  Energy-based Out-of-distribution Detection , 2020, NeurIPS.

[60]  Ian Goodfellow,et al.  Deep Learning with Differential Privacy , 2016, CCS.

[61]  G. Corrado,et al.  End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography , 2019, Nature Medicine.

[62]  M. Becker,et al.  The Health Belief Model: A Decade Later , 1984, Health education quarterly.

[63]  Rumi Chunara,et al.  Uncertainty as a Form of Transparency: Measuring, Communicating, and Using Uncertainty , 2020, AIES.

[64]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[65]  Úlfar Erlingsson,et al.  Scalable Private Learning with PATE , 2018, ICLR.

[66]  Xinkun Nie,et al.  Quasi-oracle estimation of heterogeneous treatment effects , 2017, Biometrika.

[67]  Dan Boneh,et al.  Ensemble Adversarial Training: Attacks and Defenses , 2017, ICLR.

[68]  Scott Lundberg,et al.  A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[69]  Kevin Gimpel,et al.  A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks , 2016, ICLR.

[70]  Dan Boneh,et al.  Differentially Private Learning Needs Better Features (or Much More Data) , 2020, ICLR.

[71]  Jasper Snoek,et al.  Likelihood Ratios for Out-of-Distribution Detection , 2019, NeurIPS.

[72]  Ioanna Chouvarda,et al.  An open access database for the evaluation of respiratory sound classification algorithms , 2019, Physiological measurement.

[73]  Nassir Navab,et al.  Self-Supervised Out-of-Distribution Detection in Brain CT Scans , 2020, ArXiv.

[74]  Chandramouli Shama Sastry,et al.  Detecting Out-of-Distribution Examples with Gram Matrices , 2020, ICML.