Multiscale Memory Comparator Transformer for Few-Shot Video Segmentation

Few-shot video segmentation is the task of delineating a specific novel class in a query video using few labelled support images. Typical approaches compare support and query features while limiting comparisons to a single feature layer and thereby ignore potentially valuable information. We present a meta-learned Multiscale Memory Comparator (MMC) for few-shot video segmentation that combines information across scales within a transformer decoder. Typical multiscale transformer decoders for segmentation tasks learn a compressed representation, their queries, through information exchange across scales. Unlike previous work, we instead preserve the detailed feature maps during across scale information exchange via a multiscale memory transformer decoding to reduce confusion between the background and novel class. Integral to the approach, we investigate multiple forms of information exchange across scales in different tasks and provide insights with empirical evidence on which to use in each task. The overall comparisons among query and support features benefit from both rich semantics and precise localization. We demonstrate our approach primarily on few-shot video object segmentation and an adapted version on the fully supervised counterpart. In all cases, our approach outperforms the baseline and yields state-of-the-art performance. Our code is publicly available at https://github.com/MSiam/MMC-MultiscaleMemory.

[1]  Seung Wook Kim,et al.  Cost Aggregation with 4D Convolutional Swin Transformer for Few-Shot Segmentation , 2022, ECCV.

[2]  Richard P. Wildes,et al.  A Deeper Dive Into What Deep Spatiotemporal Networks Encode: Quantifying Static vs. Dynamic Information , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Shuicheng Yan,et al.  Improving Vision Transformers by Revisiting High-frequency Components , 2022, ECCV.

[4]  J. Malik,et al.  MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Alexander G. Schwing,et al.  Mask2Former for Video Instance Segmentation , 2021, ArXiv.

[6]  A. Schwing,et al.  Masked-attention Mask Transformer for Universal Image Segmentation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  J. Malik,et al.  MViTv2: Improved Multiscale Vision Transformers for Classification and Detection , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Peng Yun,et al.  Deep Metric Learning for Open World Semantic Segmentation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Ling Shao,et al.  Full-duplex strategy for video object segmentation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  Wei Liu,et al.  CrossFormer: A Versatile Vision Transformer Hinging on Cross-scale Attention , 2021, ICLR.

[11]  David J. Crandall,et al.  A Survey on Deep Learning Technique for Video Segmentation , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Stephen Lin,et al.  Video Swin Transformer , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Guoqiang Han,et al.  Reciprocal Transformations for Unsupervised Video Object Segmentation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Nanxuan Zhao,et al.  Delving Deep into Many-to-many Attention for Few-shot Video Object Segmentation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Julien Mairal,et al.  Emerging Properties in Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[16]  Christoph Feichtenhofer,et al.  Multiscale Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[17]  Erika Lu,et al.  Self-supervised Video Object Segmentation by Motion Grouping , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[18]  Cees G. M. Snoek,et al.  Few-Shot Transformation of Common Actions into Time and Space , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Minsu Cho,et al.  Hypercorrelation Squeeze for Few-Shot Segmenation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[20]  Chunhua Shen,et al.  End-to-End Video Instance Segmentation with Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Andrew Zisserman,et al.  Betrayed by Motion: Camouflaged Object Discovery via Motion Segmentation , 2020, ACCV.

[22]  Qixiang Ye,et al.  Prototype Mixture Models for Few-shot Semantic Segmentation , 2020, ECCV.

[23]  Long Quan,et al.  Learning Discriminative Feature with CRF for Unsupervised Video Object Segmentation , 2020, ECCV.

[24]  Hengshuang Zhao,et al.  Prior Guided Feature Enrichment Network for Few-Shot Segmentation , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Xuming He,et al.  Part-aware Prototype Network for Few-shot Semantic Segmentation , 2020, ECCV.

[26]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[27]  Shawn C. Kefauver,et al.  Remote Sensing for Precision Agriculture: Sentinel-2 Improved Features and Applications , 2020, Agronomy.

[28]  Christoph Feichtenhofer,et al.  X3D: Expanding Architectures for Efficient Video Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Peter M. Full,et al.  Robust Medical Instrument Segmentation Challenge 2019 , 2020, ArXiv.

[30]  Martin Jägersand,et al.  Weakly Supervised Few-shot Object Segmentation using Co-Attention with Visual and Semantic Embeddings , 2020, IJCAI.

[31]  Ross B. Girshick,et al.  A Multigrid Method for Efficiently Training Video Models , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Ling Shao,et al.  Zero-Shot Video Object Segmentation via Attentive Graph Neural Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[33]  Luca Bertinetto,et al.  Anchor Diffusion for Unsupervised Video Object Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[34]  R. Hartley,et al.  EpO-Net: Exploiting Geometric Constraints on Dense Trajectories for Motion Saliency , 2019, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[35]  Jiashi Feng,et al.  PANet: Few-Shot Image Semantic Segmentation With Prototype Alignment , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[36]  Sanyuan Zhao,et al.  Learning Unsupervised Video Object Segmentation Through Visual Attention , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Ling Shao,et al.  See More, Know More: Unsupervised Video Object Segmentation With Co-Attention Siamese Networks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Hugo Larochelle,et al.  Meta-Dataset: A Dataset of Datasets for Learning to Learn from Few Examples , 2019, ICLR.

[39]  Sanyuan Zhao,et al.  Pyramid Dilated Deeper ConvLSTM for Video Salient Object Detection , 2018, ECCV.

[40]  Matthew A. Brown,et al.  Low-Shot Learning with Imprinted Weights , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[41]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[42]  Byron Boots,et al.  One-Shot Learning for Semantic Segmentation , 2017, BMVC.

[43]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  Kaiming He,et al.  Focal Loss for Dense Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[45]  Huchuan Lu,et al.  Learning to Detect Salient Objects with Image-Level Supervision , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[47]  Karteek Alahari,et al.  Learning Video Object Segmentation with Visual Memory , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[48]  Richard S. Zemel,et al.  Prototypical Networks for Few-shot Learning , 2017, NIPS.

[49]  K. Grauman,et al.  FusionSeg: Learning to Combine Motion and Appearance for Fully Automatic Segmentation of Generic Objects in Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Serge J. Belongie,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Stella X. Yu,et al.  Multigrid Neural Architectures , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Seyed-Ahmad Ahmadi,et al.  V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[53]  Luc Van Gool,et al.  A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Chenliang Xu,et al.  Can humans fly? Action understanding with multiple classes of actors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Ronen Basri,et al.  Review of Methods Inspired by Algebraic-Multigrid for Data and Image Analysis Applications , 2015 .

[57]  C. Schmid,et al.  Learning object class detectors from weakly annotated video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[58]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[59]  Ronen Basri,et al.  Hierarchy and adaptivity in segmenting visual scenes , 2006, Nature.

[60]  J. Alison Noble,et al.  Ultrasound image segmentation: a survey , 2006, IEEE Transactions on Medical Imaging.

[61]  Demetri Terzopoulos,et al.  Image Analysis Using Multigrid Relaxation Methods , 1986, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[62]  William L. Briggs,et al.  A multigrid tutorial , 1987 .