SkinCon: A skin disease dataset densely annotated by domain experts for fine-grained model debugging and analysis

For the deployment of artificial intelligence (AI) in high-risk settings, such as healthcare, methods that provide interpretability/explainability or allow fine-grained error analysis are critical. Many recent methods for interpretability/explainability and fine-grained error analysis use concepts, which are meta-labels that are semantically meaningful to humans. However, there are only a few datasets that include concept-level meta-labels and most of these meta-labels are relevant for natural images that do not require domain expertise. Densely annotated datasets in medicine focused on meta-labels that are relevant to a single disease such as melanoma. In dermatology, skin disease is described using an established clinical lexicon that allows clinicians to describe physical exam findings to one another. To provide a medical dataset densely annotated by domain experts with annotations useful across multiple disease processes, we developed SkinCon: a skin disease dataset densely annotated by dermatologists. SkinCon includes 3230 images from the Fitzpatrick 17k dataset densely annotated with 48 clinical concepts, 22 of which have at least 50 images representing the concept. The concepts used were chosen by two dermatologists considering the clinical descriptor terms used to describe skin lesions. Examples include"plaque","scale", and"erosion". The same concepts were also used to label 656 skin disease images from the Diverse Dermatology Images dataset, providing an additional external dataset with diverse skin tone representations. We review the potential applications for the SkinCon dataset, such as probing models, concept-based explanations, and concept bottlenecks. Furthermore, we use SkinCon to demonstrate two of these use cases: debugging mistakes of an existing dermatology AI model with concepts and developing interpretable models with post-hoc concept bottleneck models.

[1]  Mauricio Orbes-Arteaga,et al.  DermX: an end-to-end framework for explainable automated dermatological diagnosis , 2022, Medical Image Anal..

[2]  James Y. Zou,et al.  Post-hoc Concept Bottleneck Models , 2022, ICLR.

[3]  Jared A. Dunnmon,et al.  Domino: Discovering Systematic Errors with Cross-Modal Embeddings , 2022, ICLR.

[4]  R. Novoa,et al.  Disparities in dermatology AI performance on a diverse, curated clinical image set , 2022, Science advances.

[5]  James Y. Zou,et al.  MetaShift: A Dataset of Datasets for Evaluating Contextual Distribution Shifts and Training Conflicts , 2022, ICLR.

[6]  James Y. Zou,et al.  Meaningfully debugging model mistakes using conceptual counterfactual explanations , 2021, ICML.

[7]  L. Soenksen,et al.  Evaluating Deep Neural Networks Trained on Clinical Images in Dermatology with the Fitzpatrick 17k Dataset , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[8]  Noga Zaslavsky,et al.  Probing artificial neural networks: insights from neuroscience , 2021, ArXiv.

[9]  J. Lipoff,et al.  Equity in skin typing: why it is time to replace the Fitzpatrick scale , 2021, The British journal of dermatology.

[10]  Been Kim,et al.  Concept Bottleneck Models , 2020, ICML.

[11]  Muhammad Naseer Bajwa,et al.  On Interpretability of Deep Learning based Skin Lesion Classifiers using Concept Activation Vectors , 2020, 2020 International Joint Conference on Neural Networks (IJCNN).

[12]  Song-Chun Zhu,et al.  CoCoX: Generating Conceptual and Counterfactual Explanations via Fault-Lines , 2020, AAAI.

[13]  Sercan Ö. Arik,et al.  On Completeness-aware Concept-Based Explanations in Deep Neural Networks , 2019, NeurIPS.

[14]  Jared A. Dunnmon,et al.  Hidden stratification causes clinically meaningful failures in machine learning for medical imaging , 2019, CHIL.

[15]  Tim Kraska,et al.  Automated Data Slicing for Model Validation: A Big Data - AI Integration Approach , 2018, IEEE Transactions on Knowledge and Data Engineering.

[16]  Ghassan Hamarneh,et al.  Seven-Point Checklist and Skin Lesion Classification Using Multitask Multimodal Neural Nets , 2019, IEEE Journal of Biomedical and Health Informatics.

[17]  James Zou,et al.  Towards Automatic Concept-based Explanations , 2019, NeurIPS.

[18]  Marcus A. Badgeley,et al.  Deep learning predicts hip fracture using confounding patient and healthcare variables , 2018, npj Digital Medicine.

[19]  Christoph H. Lampert,et al.  Zero-Shot Learning—A Comprehensive Evaluation of the Good, the Bad and the Ugly , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Martin Wattenberg,et al.  Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV) , 2017, ICML.

[21]  Bolei Zhou,et al.  Network Dissection: Quantifying Interpretability of Deep Visual Representations , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Sebastian Thrun,et al.  Dermatologist-level classification of skin cancer with deep neural networks , 2017, Nature.

[23]  Yoshua Bengio,et al.  Understanding intermediate layers using linear classifier probes , 2016, ICLR.

[24]  Yonatan Belinkov,et al.  Fine-grained Analysis of Sentence Embeddings Using Auxiliary Prediction Tasks , 2016, ICLR.

[25]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[26]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Pedro M. Ferreira,et al.  PH2 - A dermoscopic image database for research and benchmarking , 2013, 2013 35th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC).

[28]  David D. Cox,et al.  Functional magnetic resonance imaging (fMRI) “brain reading”: detecting and classifying distributed patterns of fMRI activity in human visual cortex , 2003, NeuroImage.