Medical Transformer: Gated Axial-Attention for Medical Image Segmentation

Over the past decade, Deep Convolutional Neural Networks have been widely adopted for medical image segmentation and shown to achieve adequate performance. However, due to the inherent inductive biases present in the convolutional architectures, they lack understanding of long-range dependencies in the image. Recently proposed Transformerbased architectures that leverage self-attention mechanism encode longrange dependencies and learn representations that are highly expressive. This motivates us to explore Transformer-based solutions and study the feasibility of using Transformer-based network architectures for medical image segmentation tasks. Majority of existing Transformer-based network architectures proposed for vision applications require large-scale datasets to train properly. However, compared to the datasets for vision applications, for medical imaging the number of data samples is relatively low, making it difficult to efficiently train transformers for medical applications. To this end, we propose a Gated Axial-Attention model which extends the existing architectures by introducing an additional control mechanism in the self-attention module. Furthermore, to train the model effectively on medical images, we propose a Local-Global training strategy (LoGo) which further improves the performance. Specifically, we operate on the whole image and patches to learn global and local features, respectively. The proposed Medical Transformer (MedT) is evaluated on three different medical image segmentation datasets and it is shown that it achieves better performance than the convolutional and other related transformer-based architectures. Code: https://github.com/jeya-mariajose/Medical-Transformer

[1]  Jes'us Villalba,et al.  Hierarchical Transformers for Long Document Classification , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[2]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[3]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[4]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Xiaogang Wang,et al.  Pyramid Scene Parsing Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers , 2020, ArXiv.

[7]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[8]  Ashish Vaswani,et al.  Self-Attention with Relative Position Representations , 2018, NAACL.

[9]  Nuno Vasconcelos,et al.  Volumetric Attention for 3D Medical Image Segmentation and Detection , 2019, MICCAI.

[10]  Iasonas Kokkinos,et al.  Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs , 2014, ICLR.

[11]  Yunchao Wei,et al.  CCNet: Criss-Cross Attention for Semantic Segmentation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[12]  Yan Wang,et al.  TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation , 2021, ArXiv.

[13]  Loïc Le Folgoc,et al.  Attention U-Net: Learning Where to Look for the Pancreas , 2018, ArXiv.

[14]  Tim Salimans,et al.  Axial Attention in Multidimensional Transformers , 2019, ArXiv.

[15]  Myle Ott,et al.  Scaling Neural Machine Translation , 2018, WMT.

[16]  Hao Chen,et al.  A Multi-Organ Nucleus Segmentation Challenge , 2020, IEEE Transactions on Medical Imaging.

[17]  A. Yuille,et al.  Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation , 2020, ECCV.

[18]  Thomas Brox,et al.  3D U-Net: Learning Dense Volumetric Segmentation from Sparse Annotation , 2016, MICCAI.

[19]  Lanfen Lin,et al.  UNet 3+: A Full-Scale Connected UNet for Medical Image Segmentation , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[21]  Chi-Wing Fu,et al.  H-DenseUNet: Hybrid Densely Connected UNet for Liver and Tumor Segmentation From CT Volumes , 2018, IEEE Transactions on Medical Imaging.

[22]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[23]  Roberto Cipolla,et al.  SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Hao Chen,et al.  Gland segmentation in colon histology images: The glas challenge contest , 2016, Medical Image Anal..

[25]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[26]  Jose Dolz,et al.  Multi-Scale Self-Guided Attention for Medical Image Segmentation , 2021, IEEE Journal of Biomedical and Health Informatics.

[27]  W. Eric L. Grimson,et al.  A shape-based approach to the segmentation of medical imagery using level sets , 2003, IEEE Transactions on Medical Imaging.

[28]  Vishal M. Patel,et al.  KiU-Net: Overcomplete Convolutional Architectures for Biomedical Image and Volumetric Segmentation , 2020, IEEE Transactions on Medical Imaging.

[29]  Vishal M. Patel,et al.  Automatic real-time CNN-based neonatal brain ventricles segmentation , 2018, 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018).

[30]  Zhiming Luo,et al.  Weighted Res-UNet for High-Quality Retina Vessel Segmentation , 2018, 2018 9th International Conference on Information Technology in Medicine and Education (ITME).

[31]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[32]  Tao Xiang,et al.  Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Seyed-Ahmad Ahmadi,et al.  V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[34]  Gui Wei-hua,et al.  Medical Images Edge Detection Based on Mathematical Morphology , 2005, 2005 IEEE Engineering in Medicine and Biology 27th Annual Conference.

[35]  Vishal M. Patel,et al.  KiU-Net: Towards Accurate Segmentation of Biomedical Images using Over-complete Representations , 2020, MICCAI.

[36]  Ron Kikinis,et al.  Markov random field segmentation of brain MR images , 1997, IEEE Transactions on Medical Imaging.

[37]  Nima Tajbakhsh,et al.  UNet++: A Nested U-Net Architecture for Medical Image Segmentation , 2018, DLMIA/ML-CDS@MICCAI.

[38]  Zepeng Hao,et al.  Transformer-Based Neural Network for Answer Selection in Question Answering , 2019, IEEE Access.

[39]  Huiye Liu,et al.  TransFuse: Fusing Transformers and CNNs for Medical Image Segmentation , 2021, MICCAI.

[40]  Linda G. Shapiro,et al.  Y-Net: Joint Segmentation and Classification for Diagnosis of Breast Biopsy Images , 2018, MICCAI.

[41]  Vishal M. Patel,et al.  Learning to Segment Brain Anatomy From 2D Ultrasound With Less Data , 2019, IEEE Journal of Selected Topics in Signal Processing.

[42]  Surabhi Bhargava,et al.  A Dataset and a Technique for Generalized Nuclear Segmentation for Computational Pathology , 2017, IEEE Transactions on Medical Imaging.

[43]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[44]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[45]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .