Video Object Segmentation with Referring Expressions

Most semi-supervised video object segmentation methods rely on a pixel-accurate mask of a target object provided for the first video frame. However, obtaining a detailed mask is expensive and time-consuming. In this work we explore a more practical and natural way of identifying a target object by employing language referring expressions. Leveraging recent advances of language grounding models designed for images, we propose an approach to extend them to video data, ensuring temporally coherent predictions. To evaluate our approach we augment the popular video object segmentation benchmarks, \(\text {DAVIS}_{\text {16}}\) and \(\text {DAVIS}_{\text {17}}\), with language descriptions of target objects. We show that our approach performs on par with the methods which have access to the object mask on \(\text {DAVIS}_{\text {16}}\) and is competitive to methods using scribbles on challenging \(\text {DAVIS}_{\text {17}}\).

[1]  Shi-Min Hu,et al.  Global contrast based salient region detection , 2011, CVPR 2011.

[2]  Bernt Schiele,et al.  Video Object Segmentation with Language Referring Expressions , 2018, ACCV.

[3]  Luc Van Gool,et al.  One-Shot Video Object Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Luc Van Gool,et al.  A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Thomas Brox,et al.  FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Licheng Yu,et al.  MAttNet: Modular Attention Network for Referring Expression Comprehension , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[7]  Bastian Leibe,et al.  Online Adaptation of Convolutional Neural Networks for Video Object Segmentation , 2017, BMVC.

[8]  Luc Van Gool,et al.  The 2018 DAVIS Challenge on Video Object Segmentation , 2018, ArXiv.

[9]  George Papandreou,et al.  Rethinking Atrous Convolution for Semantic Image Segmentation , 2017, ArXiv.

[10]  Luc Van Gool,et al.  Deep Extreme Cut: From Extreme Points to Object Segmentation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[11]  Karteek Alahari,et al.  Learning Video Object Segmentation with Visual Memory , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).