Learning a Weakly-Supervised Video Actor-Action Segmentation Model With a Wise Selection

We address weakly-supervised video actor-action segmentation (VAAS), which extends general video object segmentation (VOS) to additionally consider action labels of the actors. The most successful methods on VOS synthesize a pool of pseudo-annotations (PAs) and then refine them iteratively. However, they face challenges as to how to select from a massive amount of PAs high-quality ones, how to set an appropriate stop condition for weakly-supervised training, and how to initialize PAs pertaining to VAAS. To overcome these challenges, we propose a general Weakly-Supervised framework with a Wise Selection of training samples and model evaluation criterion (WS^2). Instead of blindly trusting quality-inconsistent PAs, WS^2 employs a learning-based selection to select effective PAs and a novel region integrity criterion as a stopping condition for weakly-supervised training. In addition, a 3D-Conv GCAM is devised to adapt to the VAAS task. Extensive experiments show that WS^2 achieves state-of-the-art performance on both weakly-supervised VOS and VAAS tasks and is on par with the best fully-supervised method on VAAS.

[1]  Abhishek Das,et al.  Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[2]  Trevor Darrell,et al.  Learning to Segment Every Thing , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[3]  Yao Zhao,et al.  Object Region Mining with Adversarial Erasing: A Simple Classification to Semantic Segmentation Approach , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Mubarak Shah,et al.  Video Object Segmentation through Spatially Accurate and Temporally Dense Extraction of Primary Object Regions , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Chenliang Xu,et al.  Actor-Action Semantic Segmentation with Grouping Process Models , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Karteek Alahari,et al.  Weakly-Supervised Semantic Segmentation Using Motion Cues , 2016, ECCV.

[8]  Juha Karhunen,et al.  Semi-supervised Domain Adaptation for Weakly Labeled Semantic Video Object Segmentation , 2016, ACCV.

[9]  Chenliang Xu,et al.  Weakly Supervised Actor-Action Segmentation via Robust Multi-task Ranking , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  N. Otsu A threshold selection method from gray level histograms , 1979 .

[11]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Matthew A. Brown,et al.  Learning to Segment via Cut-and-Paste , 2018, ECCV.

[13]  Juan Carlos Niebles,et al.  End-to-End Joint Semantic Segmentation of Actors and Actions in Video , 2018, ECCV.

[14]  Shih-Fu Chang,et al.  Localizing Actions from Video Labels and Pseudo-Annotations , 2017, BMVC.

[15]  Seunghoon Hong,et al.  Weakly Supervised Semantic Segmentation Using Web-Crawled Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Kristen Grauman,et al.  Supervoxel-Consistent Foreground Propagation in Video , 2014, ECCV.

[17]  Liang Lin,et al.  An Approach to Streaming Video Segmentation With Sub-Optimal Low-Rank Decomposition , 2016, IEEE Transactions on Image Processing.

[18]  Kristen Grauman,et al.  FusionSeg: Learning to Combine Motion and Appearance for Fully Automatic Segmentation of Generic Objects in Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[20]  Cordelia Schmid,et al.  AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21]  Yuri Boykov,et al.  Normalized Cut Loss for Weakly-Supervised CNN Segmentation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22]  Philip H. S. Torr,et al.  Weakly- and Semi-Supervised Panoptic Segmentation , 2022 .

[23]  Ming-Hsuan Yang,et al.  Semantic Co-segmentation in Videos , 2016, ECCV.

[24]  Luc Van Gool,et al.  The 2018 DAVIS Challenge on Video Object Segmentation , 2018, ArXiv.

[25]  Cordelia Schmid,et al.  Joint Learning of Object and Action Detectors , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[26]  Qin Huang,et al.  Multiple Instance Curriculum Learning for Weakly Supervised Object Detection , 2017, BMVC.

[27]  Fei-Fei Li,et al.  Discriminative Segment Annotation in Weakly Labeled Video , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Qiang Qiu,et al.  Weakly Supervised Instance Segmentation Using Class Peak Response , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[29]  Ismail Ben Ayed,et al.  On Regularized Losses for Weakly-supervised CNN Segmentation , 2018, ECCV.

[30]  Chen Wang,et al.  Semantic object segmentation via detection in weakly labeled video , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Qi Tian,et al.  Zigzag Learning for Weakly Supervised Object Detection , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33]  Chenliang Xu,et al.  Can humans fly? Action understanding with multiple classes of actors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Seunghoon Hong,et al.  Weakly Supervised Semantic Segmentation Using Superpixel Pooling Network , 2017, AAAI.

[35]  Ian D. Reid,et al.  Bootstrapping the Performance of Webly Supervised Semantic Segmentation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36]  Luc Van Gool,et al.  A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Jordi Pont-Tuset,et al.  Convolutional Oriented Boundaries: From Image Segmentation to High-Level Tasks , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Tao Mei,et al.  Learning Deep Spatio-Temporal Dependence for Semantic Video Segmentation , 2018, IEEE Transactions on Multimedia.

[39]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[40]  Trevor Darrell,et al.  Learning Features by Watching Objects Move , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Cees Snoek,et al.  Actor and Action Video Segmentation from a Sentence , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[42]  Wenyu Liu,et al.  Multiple Instance Detection Network with Online Instance Classifier Refinement , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Bernard Ghanem,et al.  W2F: A Weakly-Supervised to Fully-Supervised Framework for Object Detection , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[44]  Koen E. A. van de Sande,et al.  Selective Search for Object Recognition , 2013, International Journal of Computer Vision.

[45]  Pascal Fua,et al.  SLIC Superpixels Compared to State-of-the-Art Superpixel Methods , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[46]  Vittorio Ferrari,et al.  Fast Object Segmentation in Unconstrained Video , 2013, 2013 IEEE International Conference on Computer Vision.

[47]  Dong Xu,et al.  SPFTN: A Self-Paced Fine-Tuning Network for Segmenting Objects in Weakly Labelled Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Ali Borji,et al.  Revisiting Video Saliency: A Large-Scale Benchmark and a New Model , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[49]  Jian Sun,et al.  BoxSup: Exploiting Bounding Boxes to Supervise Convolutional Networks for Semantic Segmentation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[50]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[51]  Yunchao Wei,et al.  Learning to Segment Human by Watching YouTube , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.