UNesT: Local Spatial Representation Learning with Hierarchical Transformer for Efficient Medical Segmentation

Transformer-based models, capable of learning better global dependencies, have recently demonstrated exceptional representation learning capabilities in computer vision and medical image analysis. Transformer reformats the image into separate patches and realizes global communication via the self-attention mechanism. However, positional information between patches is hard to preserve in such 1D sequences, and loss of it can lead to sub-optimal performance when dealing with large amounts of heterogeneous tissues of various sizes in 3D medical image segmentation. Additionally, current methods are not robust and efficient for heavy-duty medical segmentation tasks such as predicting a large number of tissue classes or modeling globally inter-connected tissue structures. To address such challenges and inspired by the nested hierarchical structures in vision transformer, we proposed a novel 3D medical image segmentation method (UNesT), employing a simplified and faster-converging transformer encoder design that achieves local communication among spatially adjacent patch sequences by aggregating them hierarchically. We extensively validate our method on multiple challenging datasets, consisting of multiple modalities, anatomies, and a wide range of tissue classes, including 133 structures in the brain, 14 organs in the abdomen, 4 hierarchical components in the kidneys, inter-connected kidney tumors and brain tumors. We show that UNesT consistently achieves state-of-the-art performance and evaluate its generalizability and data efficiency. Particularly, the model achieves whole brain segmentation task complete ROI with 133 tissue classes in a single network, outperforming prior state-of-the-art method SLANT27 ensembled with 27 networks.

[1]  Ho Hin Lee,et al.  Label efficient segmentation of single slice thigh CT with two-stage pseudo labels , 2022, Journal of medical imaging.

[2]  Ho Hin Lee,et al.  Quantification of muscle, bones, and fat on single slice thigh CT , 2022, Medical Imaging.

[3]  B. Landman,et al.  Self-Supervised Pre-Training of Swin Transformers for 3D Medical Image Analysis , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Xian-Hua Han,et al.  Mixed Transformer U-Net for Medical Image Segmentation , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Qichao Zhou,et al.  Boundary-Aware Transformers for Skin Lesion Segmentation , 2021, MICCAI.

[6]  Qianni Zhang,et al.  GT U-Net: A U-Net Like Group Transformer Network for Tooth Root Segmentation , 2021, MLMI@MICCAI.

[7]  Xueguang Yuan,et al.  MISSFormer: An Effective Medical Image Segmentation Transformer , 2021, ArXiv.

[8]  Yizhou Yu,et al.  nnFormer: Interleaved Transformer for Volumetric Segmentation , 2021, ArXiv.

[9]  H. Fu,et al.  Polyp-PVT: Polyp Segmentation with Pyramid Vision Transformers , 2021, CAAI Artificial Intelligence Research.

[10]  Guangming Lu,et al.  TransAttUnet: Multi-Level Attention-Guided U-Net With Transformer for Medical Image Segmentation , 2021, IEEE Transactions on Emerging Topics in Computational Intelligence.

[11]  Guangtao Zhai,et al.  Transclaw U-Net: Claw U-Net With Transformers for Medical Image Segmentation , 2021, 2022 5th International Conference on Information Communication and Signal Processing (ICICSP).

[12]  Christos Davatzikos,et al.  The RSNA-ASNR-MICCAI BraTS 2021 Benchmark on Brain Tumor Segmentation and Radiogenomic Classification , 2021, ArXiv.

[13]  Guangming Lu,et al.  DS-TransUNet: Dual Swin Transformer U-Net for Medical Image Segmentation , 2021, IEEE Transactions on Instrumentation and Measurement.

[14]  Alexander Kolesnikov,et al.  Scaling Vision Transformers , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Tomas Pfister,et al.  Nested Hierarchical Transformer: Towards Accurate, Data-Efficient and Interpretable Visual Understanding , 2021, AAAI.

[16]  Xiuchao Sui,et al.  Medical Image Segmentation using Squeeze-and-Expansion Transformers , 2021, IJCAI.

[17]  Qi Tian,et al.  Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation , 2021, ECCV Workshops.

[18]  Yutong Lin,et al.  Self-Supervised Learning with Swin Transformers , 2021, ArXiv.

[19]  Baozhou Sun,et al.  Pyramid Medical Transformer for Medical Image Segmentation , 2021, ArXiv.

[20]  Lihi Zelnik-Manor,et al.  An Image is Worth 16x16 Words, What is a Video Worth? , 2021, ArXiv.

[21]  Wenxuan Wang,et al.  TransBTS: Multimodal Brain Tumor Segmentation Using Transformer , 2021, MICCAI.

[22]  Chunhua Shen,et al.  CoTr: Efficiently Bridging CNN and Transformer for 3D Medical Image Segmentation , 2021, MICCAI.

[23]  Enhua Wu,et al.  Transformer in Transformer , 2021, NeurIPS.

[24]  Vishal M. Patel,et al.  Medical Transformer: Gated Axial-Attention for Medical Image Segmentation , 2021, MICCAI.

[25]  Yundong Zhang,et al.  TransFuse: Fusing Transformers and CNNs for Medical Image Segmentation , 2021, MICCAI.

[26]  Shunxing Bao,et al.  Renal cortex, medulla and pelvicaliceal system segmentation on arterial phase CT images with random patch-based networks , 2021, Medical Imaging.

[27]  Yan Wang,et al.  TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation , 2021, ArXiv.

[28]  Shunxing Bao,et al.  High-resolution 3D abdominal segmentation with random patch network fusion , 2020, Medical Image Anal..

[29]  Jens Petersen,et al.  nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation , 2020, Nature Methods.

[30]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[31]  Arman Cohan,et al.  Longformer: The Long-Document Transformer , 2020, ArXiv.

[32]  Yaozong Gao,et al.  The state of the art in kidney and kidney tumor segmentation in contrast-enhanced CT imaging: Results of the KiTS19 Challenge , 2019, Medical Image Anal..

[33]  Martin Jaggi,et al.  On the Relationship between Self-Attention and Convolutional Layers , 2019, ICLR.

[34]  Stephen Lin,et al.  Local Relation Networks for Image Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[35]  Shunxing Bao,et al.  3D whole brain segmentation using spatially localized atlas network tiles , 2019, NeuroImage.

[36]  Andriy Myronenko,et al.  3D MRI brain tumor segmentation using autoencoder regularization , 2018, BrainLes@MICCAI.

[37]  Nima Tajbakhsh,et al.  UNet++: A Nested U-Net Architecture for Medical Image Segmentation , 2018, DLMIA/ML-CDS@MICCAI.

[38]  Yuichiro Hayashi,et al.  A multi-scale pyramid of 3D fully convolutional networks for abdominal multi-organ segmentation , 2018, MICCAI.

[39]  A. Yuille,et al.  A 3D Coarse-to-Fine Framework for Volumetric Medical Image Segmentation , 2017, 2018 International Conference on 3D Vision (3DV).

[40]  Xinjian Chen,et al.  CorteXpert: A model‐based method for automatic renal cortex segmentation , 2017, Medical Image Anal..

[41]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[42]  Nassir Navab,et al.  Error Corrective Boosting for Learning Fully Convolutional Networks with Limited Data , 2017, MICCAI.

[43]  Andrea Vedaldi,et al.  Instance Normalization: The Missing Ingredient for Fast Stylization , 2016, ArXiv.

[44]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[45]  Thomas Brox,et al.  3D U-Net: Learning Dense Volumetric Segmentation from Sparse Annotation , 2016, MICCAI.

[46]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[47]  Xinjian Chen,et al.  3D Fast Automatic Segmentation of Kidney Based on Modified AAM and Random Forest , 2016, IEEE Transactions on Medical Imaging.

[48]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[49]  Bennett A. Landman,et al.  Hierarchical performance estimation in the statistical label fusion framework , 2014, Medical Image Anal..

[50]  Xinjian Chen,et al.  An automatic method for renal cortex segmentation on CT images: evaluation on kidney donors. , 2012, Academic radiology.

[51]  David N. Kennedy,et al.  CANDIShare: A Resource for Pediatric Neuroimaging Data , 2011, Neuroinformatics.

[52]  Brian B. Avants,et al.  N4ITK: Improved N3 Bias Correction , 2010, IEEE Transactions on Medical Imaging.

[53]  John G. Csernansky,et al.  Open Access Series of Imaging Studies (OASIS): Cross-sectional MRI Data in Young, Middle Aged, Nondemented, and Demented Older Adults , 2007, Journal of Cognitive Neuroscience.

[54]  D. Louis Collins,et al.  A new improved version of the realistic digital brain phantom , 2006, NeuroImage.

[55]  Terry M. Peters,et al.  3D statistical neuroanatomical models from 305 MRI volumes , 1993, 1993 IEEE Conference Record Nuclear Science Symposium and Medical Imaging Conference.

[56]  Shan Yang,et al.  TotalSegmentator: robust segmentation of 104 anatomical structures in CT images , 2022, ArXiv.

[57]  Zixuan Wang,et al.  Exploiting full Resolution Feature Context for Liver Tumor and Vessel Segmentation via Fusion Encoder: Application to Liver Tumor and Vessel 3D reconstruction , 2021, ArXiv.

[58]  Munawar Hayat,et al.  A Volumetric Transformer for Accurate 3D Tumor Segmentation , 2021, ArXiv.

[59]  Yong Xia,et al.  Unified 2D and 3D Pre-training for Medical Image classification and Segmentation , 2021, ArXiv.

[60]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[61]  Yitian Zhao,et al.  TransBridge: A Lightweight Transformer for Left Ventricle Segmentation in Echocardiography , 2021, ASMUS@MICCAI.

[62]  J. Suzuki Regularization , 2021, Statistical Learning with Math and Python.

[63]  Sébastien Ourselin,et al.  Reconstructing a 3D structure from serial histological sections , 2001, Image Vis. Comput..