Multi-frame Collaboration for Effective Endoscopic Video Polyp Detection via Spatial-Temporal Feature Transformation

Precise localization of polyp is crucial for early cancer screening in gastrointestinal endoscopy. Videos given by endoscopy bring both richer contextual information as well as more challenges than still images. The cameramoving situation, instead of the common camera-fixed-object-moving one, leads to significant background variation between frames. Severe internal artifacts (e.g. water flow in the human body, specular reflection by tissues) can make the quality of adjacent frames vary considerately. These factors hinder a video-based model to effectively aggregate features from neighborhood frames and give better predictions. In this paper, we present Spatial-Temporal Feature Transformation (STFT), a multi-frame collaborative framework to address these issues. Spatially, STFT mitigates inter-frame variations in the camera-moving situation with feature alignment by proposal-guided deformable convolutions. Temporally, STFT proposes a channel-aware attention module to simultaneously estimate the quality and correlation of adjacent frames for adaptive feature aggregation. Empirical studies and superior results demonstrate the effectiveness and stability of our method. For example, STFT improves the still image baseline FCOS by 10.6% and 20.6% on the comprehensive F1-score of the polyp localization task in CVCClinic and ASUMayo datasets, respectively, and outperforms the state-of-theart video-based method by 3.6% and 8.0%, respectively. Code is available at https://github.com/lingyunwu14/STFT.

[1]  Shaoting Zhang,et al.  SenseCare: A Research Platform for Medical Image Informatics and Interactive 3D Visualization , 2020, ArXiv.

[2]  A. Jemal,et al.  Cancer Statistics, 2008 , 2008, CA: a cancer journal for clinicians.

[3]  Jianbo Shi,et al.  Object Detection in Video with Spatiotemporal Sampling Networks , 2018, ECCV.

[4]  Jun Fu,et al.  Dual Attention Network for Scene Segmentation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Yi Li,et al.  R-FCN: Object Detection via Region-based Fully Convolutional Networks , 2016, NIPS.

[6]  Nima Tajbakhsh,et al.  Automated Polyp Detection in Colonoscopy Videos Using Shape and Context Information , 2016, IEEE Transactions on Medical Imaging.

[7]  Yue Cao,et al.  Memory Enhanced Global-Local Aggregation for Video Object Detection , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Hao Chen,et al.  FCOS: Fully Convolutional One-Stage Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[12]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[13]  Yujie Wang,et al.  Flow-Guided Feature Aggregation for Video Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[14]  Junzhou Huang,et al.  Polyp Tracking in Video Colonoscopy Using Optical Flow With an On-The-Fly Trained CNN , 2019, 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019).

[15]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[16]  Zijian Zhang,et al.  Asynchronous in Parallel Detection and Tracking (AIPDT): Real-Time Robust Polyp Detection , 2020, MICCAI.

[17]  D. Han,et al.  The Miss Rate for Colorectal Adenoma Determined by Quality-Adjusted, Back-to-Back Colonoscopies , 2012, Gut and liver.

[18]  Ling Shao,et al.  PraNet: Parallel Reverse Attention Network for Polyp Segmentation , 2020, MICCAI.

[19]  Irina Voiculescu,et al.  Deep learning for detection and segmentation of artefact and disease instances in gastrointestinal endoscopy , 2021, Medical Image Anal..

[20]  Xavier Dray,et al.  Polyp Detection Benchmark in Colonoscopy Videos using GTCreator: A Novel Fully Configurable Tool for Easy and Fast Annotation of Image Databases , 2018 .

[21]  Tao Mei,et al.  Relation Distillation Networks for Video Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Kai Chen,et al.  Region Proposal by Guided Anchoring , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Ilangko Balasingham,et al.  Improving Automatic Polyp Detection Using CNN by Exploiting Temporal Dependency in Colonoscopy Video , 2020, IEEE Journal of Biomedical and Health Informatics.

[24]  Yi Li,et al.  Deformable Convolutional Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[25]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Ali Farhadi,et al.  YOLOv3: An Incremental Improvement , 2018, ArXiv.

[27]  Yichen Wei,et al.  Deep Feature Flow for Video Recognition , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).