UvA-DARE (Digital Academic Repository) Spot On: Action Localization from Pointly-Supervised Proposals Pointly-Supervised Proposals

. We strive for spatio-temporal localization of actions in videos. The state-of-the-art relies on action proposals at test time and selects the best one with a classifier trained on carefully annotated box annotations. Annotating action boxes in video is cumbersome, tedious, and error prone. Rather than annotating boxes, we propose to annotate actions in video with points on a sparse subset of frames only. We introduce an overlap measure between action proposals and points and incorporate them all into the objective of a non-convex Multiple Instance Learning optimization. Experimental evaluation on the UCF Sports and UCF 101 datasets shows that (i) spatio-temporal proposals can be used to train classifiers while retaining the localization performance, (ii) point annotations yield results comparable to box annotations while being significantly faster to annotate, (iii) with a minimum amount of supervision our approach is competitive to the state-of-the-art. Finally, we introduce spatio-temporal action annotations on the train and test videos of Hollywood2, resulting in Hollywood2Tubes , available at http://tinyurl.com/hollywood2tubes .

[1]  Nanning Zheng,et al.  Video Object Discovery and Co-Segmentation with Extremely Weak Supervision , 2017, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  Wei Chen,et al.  Action Detection by Implicit Intentional Motion Clustering , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[3]  Haroon Idrees,et al.  Action Localization in Videos through Context Walk , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[4]  Nicu Sebe,et al.  Unsupervised Tube Extraction Using Transductive Learning and Dense Trajectories , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[5]  Cees G. M. Snoek,et al.  Objects2action: Classifying and Localizing Actions without Any Video Example , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[6]  Cees G. M. Snoek,et al.  APT: Action localization proposals from dense trajectories , 2015, BMVC.

[7]  Gang Yu,et al.  Fast action proposals for human action detection and search , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Ran Xu,et al.  Human action segmentation with hierarchical supervoxel consistency , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Jia Xu,et al.  Learning to segment under various forms of weak supervision , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Tinne Tuytelaars,et al.  Weakly supervised object detection with convex clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Ivan Laptev,et al.  Is object localization for free? - Weakly-supervised learning with convolutional neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Fei-Fei Li,et al.  What's the Point: Semantic Segmentation with Point Supervision , 2015, ECCV.

[13]  Cordelia Schmid,et al.  Learning to Track for Spatio-Temporal Action Localization , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[14]  Martial Hebert,et al.  Watch and learn: Semi-supervised learning of object detectors from videos , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Jean Ponce,et al.  Unsupervised Object Discovery and Tracking in Video Collections , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[16]  Paolo Napoletano,et al.  An interactive tool for manual, semi-automatic and automatic video annotation , 2015, Comput. Vis. Image Underst..

[17]  Cordelia Schmid,et al.  Unsupervised object discovery and localization in the wild: Part-based matching with bottom-up region proposals , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Jitendra Malik,et al.  Finding action tubes , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  F. D. L. Torre,et al.  Multi-label Discriminative Weakly-Supervised Human Activity Recognition and Localization , 2014, ACCV.

[20]  Cordelia Schmid,et al.  Spatio-temporal Object Detection Proposals , 2014, ECCV.

[21]  Limin Wang,et al.  Video Action Detection with Relational Dynamic-Poselets , 2014, ECCV.

[22]  Patrick Bouthemy,et al.  Action Localization with Tubelets from Motion , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Cordelia Schmid,et al.  Multi-fold MIL Training for Weakly Supervised Object Localization , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Weiyu Zhang,et al.  From Actemes to Action: A Strongly-Supervised Representation for Detailed Action Understanding , 2013, 2013 IEEE International Conference on Computer Vision.

[25]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[26]  Cordelia Schmid,et al.  Towards Understanding Action Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[27]  Mubarak Shah,et al.  Spatiotemporal Deformable Part Models for Action Detection , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Thomas Mensink,et al.  Image Classification with the Fisher Vector: Theory and Practice , 2013, International Journal of Computer Vision.

[29]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[30]  Fei-Fei Li,et al.  Object-Centric Spatial Pooling for Image Classification , 2012, ECCV.

[31]  Tao Xiang,et al.  In Defence of Negative Mining for Annotating Weakly Labelled Data , 2012, ECCV.

[32]  Donald J. Patterson,et al.  International Journal of Computer Vision manuscript No. (will be inserted by the editor) Efficiently Scaling Up Crowdsourced Video Annotation A Set of Best Practices for High Quality, Economical Video Labeling , 2012 .

[33]  Iasonas Kokkinos,et al.  Discovering discriminative action parts from mid-level video representations , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Deva Ramanan,et al.  Video Annotation and Tracking with Active Learning , 2011, NIPS.

[35]  Yang Wang,et al.  Discriminative figure-centric models for joint action localization and recognition , 2011, 2011 International Conference on Computer Vision.

[36]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[37]  François Fleuret,et al.  FlowBoost — Appearance learning from sparsely annotated video , 2011, CVPR 2011.

[38]  Zicheng Liu,et al.  Cross-dataset action detection , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[39]  Antonio Torralba,et al.  Unsupervised Detection of Regions of Interest Using Iterative Link Analysis , 2009, NIPS.

[40]  Antonio Torralba,et al.  LabelMe video: Building a video database with human annotations , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[41]  Carsten Rother,et al.  Weakly supervised discriminative localization and classification: a joint learning process , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[42]  L. Itti,et al.  Quantifying center bias of observers in free viewing of dynamic natural scenes. , 2009, Journal of vision.

[43]  C. Schmid,et al.  Actions in context , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[44]  Mubarak Shah,et al.  Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[45]  Amir Roshan Zamir,et al.  Action Recognition in Realistic Sports Videos , 2014 .

[46]  Tao Xiang,et al.  Weakly Supervised Action Detection , 2011, BMVC.

[47]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[48]  David Mihalcik,et al.  The Design and Implementation of ViPER , 2005 .

[49]  Thomas Hofmann,et al.  Support Vector Machines for Multiple-Instance Learning , 2002, NIPS.