Kernelized Memory Network for Video Object Segmentation

Semi-supervised video object segmentation (VOS) is a task that involves predicting a target object in a video when the ground truth segmentation mask of the target object is given in the first frame. Recently, space-time memory networks (STM) have received significant attention as a promising solution for semi-supervised VOS. However, an important point is overlooked when applying STM to VOS. The solution (STM) is non-local, but the problem (VOS) is predominantly local. To solve the mismatch between STM and VOS, we propose a kernelized memory network (KMN). Before being trained on real videos, our KMN is pre-trained on static images, as in previous works. Unlike in previous works, we use the Hide-and-Seek strategy in pre-training to obtain the best possible results in handling occlusions and segment boundary extraction. The proposed KMN surpasses the state-of-the-art on standard benchmarks by a significant margin (+5% on DAVIS 2017 test-dev set). In addition, the runtime of KMN is 0.12 seconds per frame on the DAVIS 2016 validation set, and the KMN rarely requires extra computation, when compared with STM.

[1]  Miriam Bellver,et al.  RVOS: End-To-End Recurrent Network for Video Object Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Luc Van Gool,et al.  Blazingly Fast Video Object Segmentation with Pixel-Wise Metric Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[3]  Aggelos K. Katsaggelos,et al.  Efficient Video Object Segmentation via Network Modulation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4]  Wei Liu,et al.  CNN in MRF: Video Object Segmentation via Inference in a CNN-Based Higher-Order Spatio-Temporal MRF , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Xiaoxiao Li,et al.  Video Object Segmentation with Joint Re-identification and Attention-Aware Mask Propagation , 2018, ECCV.

[7]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[8]  Luc Van Gool,et al.  The 2017 DAVIS Challenge on Video Object Segmentation , 2017, ArXiv.

[9]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[10]  Ning Xu,et al.  YouTube-VOS: Sequence-to-Sequence Video Object Segmentation , 2018, ECCV.

[11]  Bernt Schiele,et al.  Learning Video Object Segmentation from Static Images , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Jingdong Wang,et al.  Salient Object Detection: A Discriminative Regional Feature Integration Approach , 2013, International Journal of Computer Vision.

[13]  Ning Xu,et al.  Video Object Segmentation Using Space-Time Memory Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[14]  Yizhou Wang,et al.  Video Object Segmentation by Learning Location-Sensitive Embeddings , 2018, ECCV.

[15]  Alexander G. Schwing,et al.  MaskRNN: Instance Level Video Object Segmentation , 2018, NIPS.

[16]  Richard Socher,et al.  Ask Me Anything: Dynamic Memory Networks for Natural Language Processing , 2015, ICML.

[17]  Yong Jae Lee,et al.  Hide-and-Seek: Forcing a Network to be Meticulous for Weakly-Supervised Object and Action Localization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[18]  Jean Ponce,et al.  SFNet: Learning Object-Aware Semantic Correspondence , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Bastian Leibe,et al.  FEELVOS: Fast End-To-End Embedding Learning for Video Object Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Shi-Min Hu,et al.  Global contrast based salient region detection , 2011, CVPR 2011.

[21]  Li Xu,et al.  Hierarchical Image Saliency Detection on Extended CSSD , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Jason Weston,et al.  Key-Value Memory Networks for Directly Reading Documents , 2016, EMNLP.

[23]  Bastian Leibe,et al.  Online Adaptation of Convolutional Neural Networks for Video Object Segmentation , 2017, BMVC.

[24]  Ming-Hsuan Yang,et al.  SegFlow: Joint Learning for Video Object Segmentation and Optical Flow , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[25]  Ming-Ming Cheng,et al.  EGNet: Edge Guidance Network for Salient Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[26]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[27]  Alexander G. Schwing,et al.  VideoMatch: Matching based Video Object Segmentation , 2018, ECCV.

[28]  Subhransu Maji,et al.  Semantic contours from inverse detectors , 2011, 2011 International Conference on Computer Vision.

[29]  Zhe L. Lin,et al.  Fast Video Object Segmentation via Dynamic Targeting Network , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[30]  Qingming Huang,et al.  Label Decoupling Framework for Salient Object Detection , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Jun-Sik Kim,et al.  Pixel-Level Matching for Video Object Segmentation Using Convolutional Neural Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[32]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[33]  Thomas Brox,et al.  Lucid Data Dreaming for Video Object Segmentation , 2017, International Journal of Computer Vision.

[34]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[35]  Alexander Sorkine-Hornung,et al.  Bilateral Space Video Segmentation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[37]  Sanja Fidler,et al.  DMM-Net: Differentiable Mask-Matching Network for Video Object Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[38]  Euntai Kim,et al.  Video Multitask Transformer Network , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[39]  Ming-Hsuan Yang,et al.  Fast and Accurate Online Video Object Segmentation via Tracking Parts , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40]  Peter V. Gehler,et al.  Video Propagation Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Mubarak Shah,et al.  CapsuleVOS: Semi-Supervised Video Object Segmentation Using Capsule Routing , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[42]  Xiang Bai,et al.  Asymmetric Non-Local Neural Networks for Semantic Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[43]  K.-K. Maninis,et al.  Video Object Segmentation without Temporal Information , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  Xiaojuan Qi,et al.  AGSS-VOS: Attention Guided Single-Shot Video Object Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[45]  Kalyan Sunkavalli,et al.  Fast Video Object Segmentation by Reference-Guided Mask Propagation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[46]  Dustin Tran,et al.  Image Transformer , 2018, ICML.

[47]  Michael J. Black,et al.  Video Segmentation via Object Flow , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Luc Van Gool,et al.  A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Jason Weston,et al.  End-To-End Memory Networks , 2015, NIPS.

[50]  Andrew Zisserman,et al.  Video Action Transformer Network , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Bastian Leibe,et al.  PReMVOS: Proposal-generation, Refinement and Merging for Video Object Segmentation , 2018, ACCV.

[52]  Ling Shao,et al.  RANet: Ranking Attention Network for Fast Video Object Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[53]  Luc Van Gool,et al.  One-Shot Video Object Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Michael Felsberg,et al.  A Generative Appearance Model for End-To-End Video Object Segmentation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).