Self-Supervised Video Object Segmentation by Motion-Aware Mask Propagation

We propose a self-supervised spatio-temporal matching method coined Motion-Aware Mask Propagation (MAMP) for semi-supervised video object segmentation. During training, MAMP leverages the frame reconstruction task to train the model without the need for annotations. During inference, MAMP extracts high-resolution features from each frame to build a memory bank from the features as well as the predicted masks of selected past frames. MAMP then propagates the masks from the memory bank to subsequent frames according to our motion-aware spatio-temporal matching module, also proposed in this paper. Evaluation on DAVIS-2017 and YouTube-VOS datasets show that MAMP achieves state-of-the-art performance with stronger generalization ability compared to existing self-supervised methods, i.e. 4.9% higher mean J&F on DAVIS-2017 and 4.85% higher mean J&F on the unseen categories of YouTube-VOS than the nearest competitor. Moreover, MAMP performs on par with many supervised video object segmentation methods. Our code is available at: https: //github.com/bo-miao/MAMP.

[1]  Sergio Guadarrama,et al.  Tracking Emerges by Colorizing Videos , 2018, ECCV.

[2]  Wei Liu,et al.  Self-Supervised Spatio-Temporal Representation Learning for Videos by Predicting Motion and Appearance Statistics , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Erika Lu,et al.  MAST: A Memory-Augmented Self-Supervised Tracker , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Kalyan Sunkavalli,et al.  Fast Video Object Segmentation by Reference-Guided Mask Propagation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5]  Jianbo Jiao,et al.  Self-supervised Video Representation Learning by Pace Prediction , 2020, ECCV.

[6]  King-Sun Fu,et al.  IEEE Transactions on Pattern Analysis and Machine Intelligence Publication Information , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Shengping Zhang,et al.  Efficient Regional Memory Network for Video Object Segmentation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Zihang Lai,et al.  Self-supervised Learning for Video Correspondence Flow , 2019, ArXiv.

[9]  Aggelos K. Katsaggelos,et al.  Efficient Video Object Segmentation via Network Modulation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[10]  In-So Kweon,et al.  Self-Supervised Video Representation Learning with Space-Time Cubic Puzzles , 2018, AAAI.

[11]  Bastian Leibe,et al.  PReMVOS: Proposal-generation, Refinement and Merging for Video Object Segmentation , 2018, ACCV.

[12]  Luc Van Gool,et al.  One-Shot Video Object Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Ning Xu,et al.  YouTube-VOS: A Large-Scale Video Object Segmentation Benchmark , 2018, ArXiv.

[14]  Yu Li,et al.  Fast Video Object Segmentation using the Global Context Module , 2020, ECCV.

[15]  Xueting Li,et al.  Joint-task Self-supervised Learning for Temporal Correspondence , 2019, NeurIPS.

[16]  Bastian Leibe,et al.  FEELVOS: Fast End-To-End Embedding Learning for Video Object Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Xin Li,et al.  Video Object Segmentation with Adaptive Feature Bank and Uncertain-Region Refinement , 2020, NeurIPS.

[18]  Qiang Wang,et al.  Fast Online Object Tracking and Segmentation: A Unifying Approach , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Erik Reinhard,et al.  Color Transfer between Images , 2001, IEEE Computer Graphics and Applications.

[20]  Andrew Zisserman,et al.  Video Representation Learning by Dense Predictive Coding , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[21]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[22]  Changick Kim,et al.  RPM-Net: Robust Pixel-Level Matching Networks for Self-Supervised Video Object Segmentation , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[23]  Bastian Leibe,et al.  Online Adaptation of Convolutional Neural Networks for Video Object Segmentation , 2017, BMVC.

[24]  Jia Deng,et al.  RAFT: Recurrent All-Pairs Field Transforms for Optical Flow , 2020, ECCV.

[25]  Alexander G. Schwing,et al.  VideoMatch: Matching based Video Object Segmentation , 2018, ECCV.

[26]  Shenghua Gao,et al.  Future Frame Prediction for Anomaly Detection - A New Baseline , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27]  Ning Xu,et al.  Video Object Segmentation Using Space-Time Memory Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[28]  Laura Leal-Taixe,et al.  Make One-Shot Video Object Segmentation Efficient Again , 2020, NeurIPS.

[29]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[30]  Jason Weston,et al.  Memory Networks , 2014, ICLR.

[31]  Yunchao Wei,et al.  Collaborative Video Object Segmentation by Foreground-Background Integration , 2020, ECCV.

[32]  Steven C. H. Hoi,et al.  Learning Video Object Segmentation From Unlabeled Videos , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Michael Felsberg,et al.  A Generative Appearance Model for End-To-End Video Object Segmentation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Allan Jabri,et al.  Learning Correspondence From the Cycle-Consistency of Time , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Miriam Bellver,et al.  RVOS: End-To-End Recurrent Network for Video Object Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Luc Van Gool,et al.  The 2017 DAVIS Challenge on Video Object Segmentation , 2017, ArXiv.

[37]  Thomas Brox,et al.  Lucid Data Dreaming for Video Object Segmentation , 2017, International Journal of Computer Vision.

[38]  Luc Van Gool,et al.  Blazingly Fast Video Object Segmentation with Pixel-Wise Metric Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[39]  Wei Liu,et al.  VideoMoCo: Contrastive Video Representation Learning with Temporally Adversarial Examples , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Ling Shao,et al.  RANet: Ranking Attention Network for Fast Video Object Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[41]  Bernt Schiele,et al.  Learning Video Object Segmentation from Static Images , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Euntai Kim,et al.  Kernelized Memory Network for Video Object Segmentation , 2020, ECCV.

[43]  Chi-Keung Tang,et al.  Fast Video Object Segmentation With Temporal Aggregation Network and Dynamic Template Matching , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Xiaoxiao Li,et al.  Video Object Segmentation with Joint Re-identification and Attention-Aware Mask Propagation , 2018, ECCV.

[45]  Toshihiko Yamasaki,et al.  Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework , 2020, ACM Multimedia.

[46]  K.-K. Maninis,et al.  Video Object Segmentation without Temporal Information , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.