Fast Video Object Segmentation via Mask Transfer Network

Accuracy and processing speed are two important factors that affect the use of video object segmentation (VOS) in real applications. With the advanced techniques of deep neural networks, the accuracy has been significantly improved, however, the speed is still far below the real-time needs because of the complicated network design, such as the requirement of the first frame fine-tuning step. To overcome this limitation, we propose a novel mask transfer network (MTN), which can greatly boost the processing speed of VOS and also achieve a reasonable accuracy. The basic idea of MTN is to transfer the reference mask to the target frame via an efficient global pixel matching strategy. The global pixel matching between the reference frame and the target frame is to ensure good matching results. To enhance the matching speed, we perform the matching on a downsampled feature map with 1/32 of the original frame size. At the same time, to preserve the detailed mask information in such a small feature map, a mask network is designed to encode the annotated mask information with 512 channels. Finally, an efficient feature warping method is used to transfer the encoded reference mask to the target frame. Based on this design, our method avoids the fine-tuning step on the first frame and does not rely on the temporal cues and particular object categories. Therefore, it runs very fast and can be conveniently trained only with images, as well as being robust to unseen objects. Experiments on the DAVIS datasets demonstrate that MTN can achieve a speed of 37 fps, and also shows a competitive accuracy in comparison to the state-of-the-art methods.

[1]  Ming-Hsuan Yang,et al.  Fast and Accurate Online Video Object Segmentation via Tracking Parts , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[2]  Peter V. Gehler,et al.  Video Propagation Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Thomas Brox,et al.  Lucid Data Dreaming for Multiple Object Tracking , 2017, 1703.09554.

[4]  Seunghoon Hong,et al.  Learning Transferrable Knowledge for Semantic Segmentation with Deep Convolutional Neural Network , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Chang-Su Kim,et al.  Online Video Object Segmentation via Convolutional Trident Network , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Alexander Sorkine-Hornung,et al.  Bilateral Space Video Segmentation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Alexander G. Schwing,et al.  VideoMatch: Matching based Video Object Segmentation , 2018, ECCV.

[8]  Gregory R. Koch,et al.  Siamese Neural Networks for One-Shot Image Recognition , 2015 .

[9]  Ming-Hsuan Yang,et al.  SegFlow: Joint Learning for Video Object Segmentation and Optical Flow , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[10]  Ming-Hsuan Yang,et al.  Unseen Object Segmentation in Videos via Transferable Representations , 2018, ACCV.

[11]  Qin Huang,et al.  Instance Embedding Transfer to Unsupervised Video Object Segmentation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12]  Karteek Alahari,et al.  Learning Motion Patterns in Videos , 2016, CVPR.

[13]  Bernt Schiele,et al.  Learning Video Object Segmentation from Static Images , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[15]  Kalyan Sunkavalli,et al.  Fast Video Object Segmentation by Reference-Guided Mask Propagation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[16]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Thomas Brox,et al.  FlowNet: Learning Optical Flow with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[18]  Shi-Min Hu,et al.  Global contrast based salient region detection , 2011, CVPR 2011.

[19]  Jitendra Malik,et al.  Learning to segment moving objects in videos , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Markus H. Gross,et al.  Fully Connected Object Proposals for Video Segmentation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[21]  Kristen Grauman,et al.  FusionSeg: Learning to Combine Motion and Appearance for Fully Automatic Segmentation of Generic Objects in Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Jun-Sik Kim,et al.  Pixel-Level Matching for Video Object Segmentation Using Convolutional Neural Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[23]  Xiangyu Zhang,et al.  Large Kernel Matters — Improve Semantic Segmentation by Global Convolutional Network , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Luc Van Gool,et al.  One-Shot Video Object Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Bastian Leibe,et al.  Online Adaptation of Convolutional Neural Networks for Video Object Segmentation , 2017, BMVC.

[26]  Luc Van Gool,et al.  Blazingly Fast Video Object Segmentation with Pixel-Wise Metric Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Chang-Su Kim,et al.  Primary Object Segmentation in Videos Based on Region Augmentation and Reduction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Seyed-Ahmad Ahmadi,et al.  V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[30]  Michael J. Black,et al.  Video Segmentation via Object Flow , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Jan Kautz,et al.  PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32]  Luc Van Gool,et al.  A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Aggelos K. Katsaggelos,et al.  Efficient Video Object Segmentation via Network Modulation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34]  Michal Irani,et al.  Video Segmentation by Non-Local Consensus voting , 2014, BMVC.

[35]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[36]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[37]  Dani Lischinski,et al.  JumpCut , 2015, ACM Trans. Graph..

[38]  Xiaoxiao Li,et al.  Video Object Segmentation with Joint Re-identification and Attention-Aware Mask Propagation , 2018, ECCV.