Going Deeper Into Embedding Learning for Video Object Segmentation

In this paper, we investigate the principles of consistent training, between given reference and predicted sequence, for better embedding learning of semi-supervised video object segmentation. To accurately segment the target objects given the mask at the first frame, we realize that the expected feature embeddings of any consecutive frames should satisfy the following properties: 1) global consistency in terms of both foreground object(s) and background; 2) robust local consistency under a various object moving rate; 3) environment consistency between the training and inference process; 4) receptive consistency between the receptive fields of network and the variable scales of objects; 5) sampling consistency between foreground and background pixels to avoid training bias. With the principles in mind, we carefully design a simple pipeline to lift both accuracy and efficiency for video object segmentation effectively. With the ResNet-101 as the backbone, our single model achieves a J&F score of 81.0% on the validation set of Youtube-VOS benchmark without any bells and whistles. By applying multi-scale & flip augmentation at the testing stage, the accuracy can be further boosted to 82.4%. Code will be made available.

[1]  George Papandreou,et al.  Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation , 2018, ECCV.

[2]  Ning Xu,et al.  YouTube-VOS: A Large-Scale Video Object Segmentation Benchmark , 2018, ArXiv.

[3]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Yunchao Wei,et al.  Dual Embedding Learning for Video Instance Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[5]  Antonio Torralba,et al.  Contextual Priming for Object Detection , 2003, International Journal of Computer Vision.

[6]  Thomas Brox,et al.  FlowNet: Learning Optical Flow with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[7]  Enhua Wu,et al.  Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Bastian Leibe,et al.  FEELVOS: Fast End-To-End Embedding Learning for Video Object Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Luc Van Gool,et al.  A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Bastian Leibe,et al.  PReMVOS: Proposal-generation, Refinement and Merging for Video Object Segmentation , 2018, ACCV.