Visual Transformers and Convolutional Neural Networks for Disease Classification on Radiographs: A Comparison of Performance, Sample Efficiency, and Hidden Stratification.

Purpose To compare performance, sample efficiency, and hidden stratification of visual transformer (ViT) and convolutional neural network (CNN) architectures for diagnosis of disease on chest radiographs and extremity radiographs using transfer learning. Materials and Methods In this HIPAA-compliant retrospective study, the authors fine-tuned data-efficient image transformers (DeiT) ViT and CNN classification models pretrained on ImageNet using the National Institutes of Health Chest X-ray 14 dataset (112 120 images) and MURA dataset (14 656 images) for thoracic disease and extremity abnormalities, respectively. Performance was assessed on internal test sets and 75 000 external chest radiographs (three datasets). The primary comparison was DeiT-B ViT vs DenseNet121 CNN; secondary comparisons included DeiT-Ti (Tiny), ResNet152, and EfficientNetB7. Sample efficiency was evaluated by training models on varying dataset sizes. Hidden stratification was evaluated by comparing prevalence of chest tubes in pneumothorax false-positive and false-negative predictions and specific abnormalities for MURA false-negative predictions. Results DeiT-B weighted area under the receiver operating characteristic curve (wAUC) was slightly lower than that for DenseNet121 on chest radiograph (0.78 vs 0.79; P < .001) and extremity (0.887 vs 0.893; P < .001) internal test sets and chest radiograph external test sets (P < .001 for each). DeiT-B and DeiT-Ti both performed slightly worse than all CNNs for chest radiograph and extremity tasks. DeiT-B and DenseNet121 showed similar sample efficiency. DeiT-B had lower chest tube prevalence in false-positive predictions than DenseNet121 (43.1% [324 of 5088] vs 47.9% [2290 of 4782]). Conclusion Although DeiT models had lower wAUCs than CNNs for chest radiograph and extremity domains, the differences may be negligible in clinical practice. DeiT-B had sample efficiency similar to that of DenseNet121 and may be less susceptible to certain types of hidden stratification.Keywords: Computer-aided Diagnosis, Informatics, Neural Networks, Thorax, Skeletal-Appendicular, Convolutional Neural Network (CNN), Feature Detection, Supervised Learning, Machine Learning, Deep Learning Supplemental material is available for this article. © RSNA, 2022.