DS-TransUNet: Dual Swin Transformer U-Net for Medical Image Segmentation

Automatic medical image segmentation has made great progress benefit from the development of deep learning. However, most existing methods are based on convolutional neural networks (CNNs), which fail to build long-range dependencies and global context connections due to the limitation of receptive field in convolution operation. Inspired by the success of Transformer whose self-attention mechanism has the powerful abilities of modeling the long-range contextual information, some researchers have expended considerable efforts in designing the robust variants of Transformer-based U-Net. Moreover, the patch division used in vision transformers usually ignores the pixellevel intrinsic structural features inside each patch. To alleviate these problems, in this paper, we propose a novel deep medical image segmentation framework called Dual Swin Transformer U-Net (DS-TransUNet), which might be the first attempt to concurrently incorporate the advantages of hierarchical Swin Transformer into both encoder and decoder of the standard Ushaped architecture to enhance the semantic segmentation quality of varying medical images. Unlike many prior Transformerbased solutions, the proposed DS-TransUNet first adopts dualscale encoder subnetworks based on Swin Transformer to extract the coarse and fine-grained feature representations of different semantic scales. As the core component for our DS-TransUNet, a well-designed Transformer Interactive Fusion (TIF) module is proposed to effectively establish global dependencies between features of different scales through the self-attention mechanism, in order to make full use of these obtained multi-scale feature representations. Furthermore, we also introduce the Swin Transformer block into decoder to further explore the longrange contextual information during the up-sampling process. Extensive experiments across four typical tasks for medical image segmentation demonstrate the effectiveness of DS-TransUNet, and show that our approach significantly outperforms the stateof-the-art methods.

[1]  Christoph Feichtenhofer,et al.  Multiscale Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[2]  Yu-Gang Jiang,et al.  M2TR: Multi-modal Multi-scale Transformers for Deepfake Detection , 2021, ArXiv.

[3]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[5]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[6]  Yunchao Wei,et al.  CCNet: Criss-Cross Attention for Semantic Segmentation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[7]  Antonio M. López,et al.  A Benchmark for Endoluminal Scene Segmentation of Colonoscopy Images , 2016, Journal of healthcare engineering.

[8]  Jiashi Feng,et al.  Coordinate Attention for Efficient Mobile Network Design , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Loïc Le Folgoc,et al.  Attention U-Net: Learning Where to Look for the Pancreas , 2018, ArXiv.

[10]  Lanfen Lin,et al.  UNet 3+: A Full-Scale Connected UNet for Medical Image Segmentation , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Vishal M. Patel,et al.  KiU-Net: Towards Accurate Segmentation of Biomedical Images using Over-complete Representations , 2020, MICCAI.

[12]  Harald Kittler,et al.  Descriptor : The HAM 10000 dataset , a large collection of multi-source dermatoscopic images of common pigmented skin lesions , 2018 .

[13]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[14]  Rogério Schmidt Feris,et al.  A Unified Multi-scale Deep Convolutional Neural Network for Fast Object Detection , 2016, ECCV.

[15]  Sharib Ali,et al.  Real-Time Polyp Detection, Localisation and Segmentation in Colonoscopy Using Deep Learning , 2020, ArXiv.

[16]  Noel C. F. Codella,et al.  Skin Lesion Analysis Toward Melanoma Detection 2018: A Challenge Hosted by the International Skin Imaging Collaboration (ISIC) , 2019, ArXiv.

[17]  Cheng Chen,et al.  Selective Feature Aggregation Network with Area-Boundary Constraints for Polyp Segmentation , 2019, MICCAI.

[18]  Tae Hyun Kim,et al.  Deep Multi-scale Convolutional Neural Network for Dynamic Scene Deblurring , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  HigherHRNet: Scale-Aware Representation Learning for Bottom-Up Human Pose Estimation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Yan Wang,et al.  TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation , 2021, ArXiv.

[22]  Quanfu Fan,et al.  CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Anne E Carpenter,et al.  Nucleus segmentation across imaging experiments: the 2018 Data Science Bowl , 2019, Nature Methods.

[24]  Thomas de Lange,et al.  Kvasir-SEG: A Segmented Polyp Dataset , 2019, MMM.

[25]  Fernando Vilariño,et al.  WM-DOVA maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians , 2015, Comput. Medical Imaging Graph..

[26]  Sharib Ali,et al.  FANet: A Feedback Attention Network for Improved Biomedical Image Segmentation , 2021, ArXiv.

[27]  Youn-Long Lin,et al.  HarDNet-MSEG: A Simple Encoder-Decoder Polyp Segmentation Neural Network that Achieves over 0.9 Mean Dice and 86 FPS , 2021, ArXiv.

[28]  Rogério Schmidt Feris,et al.  Big-Little Net: An Efficient Multi-Scale Feature Representation for Visual and Speech Recognition , 2018, ICLR.

[29]  Aymeric Histace,et al.  Toward embedded detection of polyps in WCE images for early diagnosis of colorectal cancer , 2014, International Journal of Computer Assisted Radiology and Surgery.

[30]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[31]  Xiaogang Wang,et al.  Pyramid Scene Parsing Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33]  Mahmood Fathy,et al.  Bi-Directional ConvLSTM U-Net with Densley Connected Convolutions , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[34]  Zhiming Luo,et al.  Weighted Res-UNet for High-Quality Retina Vessel Segmentation , 2018, 2018 9th International Conference on Information Technology in Medicine and Education (ITME).

[35]  Michael A. Riegler,et al.  DoubleU-Net: A Deep Convolutional Neural Network for Medical Image Segmentation , 2020, 2020 IEEE 33rd International Symposium on Computer-Based Medical Systems (CBMS).

[36]  Chi-Wing Fu,et al.  H-DenseUNet: Hybrid Densely Connected UNet for Liver and Tumor Segmentation From CT Volumes , 2018, IEEE Transactions on Medical Imaging.

[37]  Roberto Cipolla,et al.  SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Ling Shao,et al.  PraNet: Parallel Reverse Attention Network for Polyp Segmentation , 2020, MICCAI.

[39]  Vishal M. Patel,et al.  Medical Transformer: Gated Axial-Attention for Medical Image Segmentation , 2021, MICCAI.

[40]  Tao Xiang,et al.  Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[42]  George Papandreou,et al.  Rethinking Atrous Convolution for Semantic Image Segmentation , 2017, ArXiv.

[43]  A. Yuille,et al.  Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation , 2020, ECCV.

[44]  Yang Zhao,et al.  Deep High-Resolution Representation Learning for Visual Recognition , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  Qingjie Liu,et al.  Road Extraction by Deep Residual U-Net , 2017, IEEE Geoscience and Remote Sensing Letters.

[46]  Nima Tajbakhsh,et al.  UNet++: A Nested U-Net Architecture for Medical Image Segmentation , 2018, DLMIA/ML-CDS@MICCAI.

[47]  Nima Tajbakhsh,et al.  Automated Polyp Detection in Colonoscopy Videos Using Shape and Context Information , 2016, IEEE Transactions on Medical Imaging.

[48]  Huiye Liu,et al.  TransFuse: Fusing Transformers and CNNs for Medical Image Segmentation , 2021, MICCAI.

[49]  Thomas de Lange,et al.  ResUNet++: An Advanced Architecture for Medical Image Segmentation , 2019, 2019 IEEE International Symposium on Multimedia (ISM).

[50]  Vijayan K. Asari,et al.  Recurrent Residual Convolutional Neural Network based on U-Net (R2U-Net) for Medical Image Segmentation , 2018, ArXiv.

[51]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.