Video Object Segmentation with Language Referring Expressions

Most state-of-the-art semi-supervised video object segmentation methods rely on a pixel-accurate mask of a target object provided for the first frame of a video. However, obtaining a detailed segmentation mask is expensive and time-consuming. In this work we explore an alternative way of identifying a target object, namely by employing language referring expressions. Besides being a more practical and natural way of pointing out a target object, using language specifications can help to avoid drift as well as make the system more robust to complex dynamics and appearance variations. Leveraging recent advances of language grounding models designed for images, we propose an approach to extend them to video data, ensuring temporally coherent predictions. To evaluate our method we augment the popular video object segmentation benchmarks, DAVIS'16 and DAVIS'17 with language descriptions of target objects. We show that our language-supervised approach performs on par with the methods which have access to a pixel-level mask of the target object on DAVIS'16 and is competitive to methods using scribbles on the challenging DAVIS'17 dataset.

[1]  Ming-Hsuan Yang,et al.  JOTS: Joint Online Tracking and Segmentation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Kristen Grauman,et al.  Supervoxel-Consistent Foreground Propagation in Video , 2014, ECCV.

[3]  Xiaojuan Qi,et al.  Referring Image Segmentation via Recurrent Refinement Networks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4]  Luc Van Gool,et al.  Deep Extreme Cut: From Extreme Points to Object Segmentation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5]  Karteek Alahari,et al.  Learning Video Object Segmentation with Visual Memory , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[6]  Michal Irani,et al.  Video Segmentation by Non-Local Consensus voting , 2014, BMVC.

[7]  Licheng Yu,et al.  Modeling Context in Referring Expressions , 2016, ECCV.

[8]  George Papandreou,et al.  Rethinking Atrous Convolution for Semantic Image Segmentation , 2017, ArXiv.

[9]  B. Leibe,et al.  Adaptation of Convolutional Neural Networks for the 2017 DAVIS Challenge on Video Object Segmentation , 2017 .

[10]  Kristen Grauman,et al.  FusionSeg: Learning to Combine Motion and Appearance for Fully Automatic Segmentation of Generic Objects in Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Jun-Sik Kim,et al.  Pixel-Level Matching for Video Object Segmentation Using Convolutional Neural Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[12]  Yuting Zhang,et al.  Discriminative Bimodal Networks for Visual Localization and Detection with Natural Language Queries , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Trevor Darrell,et al.  Localizing Moments in Video with Natural Language , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[14]  Svetlana Lazebnik,et al.  Phrase Localization and Visual Relationship Detection with Comprehensive Linguistic Cues , 2016, ArXiv.

[15]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[16]  Wei Liu,et al.  ParseNet: Looking Wider to See Better , 2015, ArXiv.

[17]  Dani Lischinski,et al.  JumpCut , 2015, ACM Trans. Graph..

[18]  Licheng Yu,et al.  A Joint Speaker-Listener-Reinforcer Model for Referring Expressions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Alon Faktor,et al.  Video Object Segmentation by Non-Local Consensus voting , 2014, BMVC 2014.

[20]  Fatih Murat Porikli,et al.  Saliency-aware geodesic video object segmentation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Luc Van Gool,et al.  The 2018 DAVIS Challenge on Video Object Segmentation , 2018, ArXiv.

[22]  Jitendra Malik,et al.  Object Segmentation by Long Term Analysis of Point Trajectories , 2010, ECCV.

[23]  Rada Mihalcea,et al.  Structured Matching for Phrase Localization , 2016, ECCV.

[24]  Alexander G. Schwing,et al.  MaskRNN: Instance Level Video Object Segmentation , 2018, NIPS.

[25]  Jinjun Xiong,et al.  Interpretable and Globally Optimal Prediction for Textual Grounding using Image Concepts , 2018, NIPS.

[26]  Bernt Schiele,et al.  Learning Video Object Segmentation from Static Images , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Luc Van Gool,et al.  The 2017 DAVIS Challenge on Video Object Segmentation , 2017, ArXiv.

[28]  Thomas Brox,et al.  A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Michael J. Black,et al.  Video Segmentation via Object Flow , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Trevor Darrell,et al.  Natural Language Object Retrieval , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Chenxi Liu,et al.  Recurrent Multimodal Interaction for Referring Image Segmentation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[32]  Cees Snoek,et al.  Actor and Action Video Segmentation from a Sentence , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33]  Ramakant Nevatia,et al.  Query-Guided Regression Network with Context Policy for Phrase Grounding , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[34]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[35]  Luc Van Gool,et al.  A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Bastian Leibe,et al.  Online Adaptation of Convolutional Neural Networks for Video Object Segmentation , 2017, BMVC.

[37]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[38]  Wenguan Wang,et al.  Super-Trajectory for Video Segmentation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[39]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[40]  Thomas Brox,et al.  Video Segmentation with Just a Few Strokes , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[41]  Chang-Su Kim,et al.  Online Video Object Segmentation via Convolutional Trident Network , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[43]  Luc Van Gool,et al.  One-Shot Video Object Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Alexandre X. Falcão,et al.  FOMTrace: Interactive Video Segmentation By Image Graphs and Fuzzy Object Models , 2016, ArXiv.

[45]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[46]  Arnold W. M. Smeulders,et al.  Tracking by Natural Language Specification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Peter V. Gehler,et al.  Video Propagation Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Luc Van Gool,et al.  Object Referring in Videos with Language and Human Gaze , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[49]  Ming-Hsuan Yang,et al.  SegFlow: Joint Learning for Video Object Segmentation and Optical Flow , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[50]  Cordelia Schmid,et al.  Spatio-temporal Object Detection Proposals , 2014, ECCV.

[51]  Michael Gygli,et al.  Interactive Video Object Segmentation in the Wild , 2017, ArXiv.

[52]  James M. Rehg,et al.  Video Segmentation by Tracking Many Figure-Ground Segments , 2013, 2013 IEEE International Conference on Computer Vision.

[53]  Svetlana Lazebnik,et al.  Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[54]  Yong Jae Lee,et al.  Track and Segment: An Iterative Unsupervised Approach for Video Object Proposals , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  David A. Shamma,et al.  YFCC100M , 2015, Commun. ACM.

[56]  Alexander Sorkine-Hornung,et al.  Bilateral Space Video Segmentation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Shi-Min Hu,et al.  Global contrast based salient region detection , 2011, CVPR 2011.

[58]  Alan L. Yuille,et al.  Generation and Comprehension of Unambiguous Object Descriptions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Chang-Su Kim,et al.  Primary Object Segmentation in Videos Based on Region Augmentation and Reduction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Jan Kautz,et al.  Learning to Segment Instances in Videos with Spatial Propagation Network , 2017, ArXiv.

[62]  Jian Sun,et al.  ScribbleSup: Scribble-Supervised Convolutional Networks for Semantic Segmentation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  C. Lawrence Zitnick,et al.  Fast Edge Detection Using Structured Forests , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[64]  Thomas Brox,et al.  Lucid Data Dreaming for Multiple Object Tracking , 2017, 1703.09554.

[65]  Kaiming He,et al.  Mask R-CNN , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[66]  Larry S. Davis,et al.  Modeling Context Between Objects for Referring Expression Understanding , 2016, ECCV.

[67]  Kristen Grauman,et al.  Click Carving: Segmenting Objects in Video with Point Clicks , 2016, HCOMP.

[68]  Markus H. Gross,et al.  Fully Connected Object Proposals for Video Segmentation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[69]  Trevor Darrell,et al.  Grounding of Textual Phrases in Images by Reconstruction , 2015, ECCV.

[70]  Karteek Alahari,et al.  Learning Motion Patterns in Videos , 2016, CVPR.

[71]  Thomas Brox,et al.  FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[72]  Trevor Darrell,et al.  Segmentation from Natural Language Expressions , 2016, ECCV.

[73]  Trevor Darrell,et al.  Modeling Relationships in Referential Expressions with Compositional Modular Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[74]  Vittorio Ferrari,et al.  Fast Object Segmentation in Unconstrained Video , 2013, 2013 IEEE International Conference on Computer Vision.

[75]  Gregory Shakhnarovich,et al.  Comprehension-Guided Referring Expressions , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[76]  Garrison W. Cottrell,et al.  Understanding Convolution for Semantic Segmentation , 2017, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[77]  Licheng Yu,et al.  MAttNet: Modular Attention Network for Referring Expression Comprehension , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[78]  K.-K. Maninis,et al.  Video Object Segmentation without Temporal Information , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[79]  Kai Chen,et al.  Video Object Segmentation with Re-identification , 2017, ArXiv.