Spatio-Temporal Instance Learning: Action Tubes from Class Supervision

The goal of this work is spatio-temporal action localization in videos, using only the supervision from video-level class labels. The state-of-the-art casts this weakly-supervised action localization regime as a Multiple Instance Learning problem, where instances are a priori computed spatio-temporal proposals. Rather than disconnecting the spatio-temporal learning from the training, we propose Spatio-Temporal Instance Learning, which enables action localization directly from box proposals in video frames. We outline the assumptions of our model and propose a max-margin objective and optimization with latent variables that enable spatio-temporal learning of actions from video labels. We also provide an efficient linking algorithm and two reranking strategies to facilitate and further improve the action localization. Experimental evaluation on four action datasets demonstrate the effectiveness of our approach for localization from weak supervision. Moreover, we show how to incorporate other supervision levels and mixtures, as a step towards determining optimal supervision strategies for action localization.

[1]  Thomas G. Dietterich,et al.  Solving the Multiple Instance Problem with Axis-Parallel Rectangles , 1997, Artif. Intell..

[2]  Suman Saha,et al.  Online Real-Time Multiple Spatiotemporal Action Localisation and Prediction , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[3]  Bohyung Han,et al.  Weakly Supervised Action Localization by Sparse Temporal Pooling Network , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4]  Tao Mei,et al.  Recurrent Tubelet Proposal and Recognition Networks for Action Detection , 2018, ECCV.

[5]  Shih-Fu Chang,et al.  Localizing Actions from Video Labels and Pseudo-Annotations , 2017, BMVC.

[6]  Cordelia Schmid,et al.  Weakly Supervised Action Labeling in Videos under Ordering Constraints , 2014, ECCV.

[7]  Juan Carlos Niebles,et al.  Connectionist Temporal Modeling for Weakly Supervised Action Labeling , 2016, ECCV.

[8]  Cordelia Schmid,et al.  Towards Weakly-Supervised Action Localization , 2016, ArXiv.

[9]  Cordelia Schmid,et al.  Action Tubelet Detector for Spatio-Temporal Action Localization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[10]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[11]  Suman Saha,et al.  Incremental Tube Construction for Human Action Detection , 2017, BMVC.

[12]  Cees Snoek,et al.  VideoLSTM convolves, attends and flows for action recognition , 2016, Comput. Vis. Image Underst..

[13]  Cees Snoek,et al.  Spot On: Action Localization from Pointly-Supervised Proposals , 2016, ECCV.

[14]  Dennis Koelma,et al.  The ImageNet Shuffle: Reorganized Pre-training for Video Event Detection , 2016, ICMR.

[15]  Jiawei He,et al.  Generic Tubelet Proposals for Action Localization , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[16]  Fei-Fei Li,et al.  Best of both worlds: Human-machine collaboration for object annotation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Ramakant Nevatia,et al.  Spatio-Temporal Action Detection with Cascade Proposal and Location Anticipation , 2017, BMVC.

[18]  Suman Saha,et al.  AMTnet: Action-Micro-Tube Regression by End-to-end Trainable Deep Architecture , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[19]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Cordelia Schmid,et al.  Weakly Supervised Object Localization with Multi-Fold Multiple Instance Learning , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Cordelia Schmid,et al.  Multi-region Two-Stream R-CNN for Action Detection , 2016, ECCV.

[22]  Hendrik Blockeel,et al.  Instance-level accuracy versus bag-level accuracy in multi-instance learning , 2011, Data Mining and Knowledge Discovery.

[23]  Cordelia Schmid,et al.  Action and Event Recognition with Fisher Vectors on a Compact Feature Set , 2013, 2013 IEEE International Conference on Computer Vision.

[24]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Cordelia Schmid,et al.  Spatio-temporal Object Detection Proposals , 2014, ECCV.

[26]  Chenliang Xu,et al.  Weakly Supervised Actor-Action Segmentation via Robust Multi-task Ranking , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Cordelia Schmid,et al.  Joint Learning of Object and Action Detectors , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[28]  Yang Wang,et al.  Discriminative figure-centric models for joint action localization and recognition , 2011, 2011 International Conference on Computer Vision.

[29]  Santiago Manen,et al.  PathTrack: Fast Trajectory Annotation with Path Supervision , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[30]  Philip H. S. Torr,et al.  Learning discriminative space-time actions from weakly labelled videos , 2012, BMVC.

[31]  Hao Su,et al.  Crowdsourcing Annotations for Visual Object Detection , 2012, HCOMP@AAAI.

[32]  Cordelia Schmid,et al.  Human Action Localization with Sparse Spatial Supervision , 2017 .

[33]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[34]  Cordelia Schmid,et al.  Actions in context , 2009, CVPR.

[35]  Thomas Hofmann,et al.  Support Vector Machines for Multiple-Instance Learning , 2002, NIPS.

[36]  Amit K. Roy-Chowdhury,et al.  W-TALC: Weakly-supervised Temporal Activity Localization and Classification , 2018, ECCV.

[37]  Rui Hou,et al.  Tube Convolutional Neural Network (T-CNN) for Action Detection in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[38]  Tomás Lozano-Pérez,et al.  A Framework for Multiple-Instance Learning , 1997, NIPS.

[39]  Cees Snoek,et al.  Spatial-Aware Object Embeddings for Zero-Shot Localization and Classification of Actions , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[40]  Jitendra Malik,et al.  Finding action tubes , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Jun Zhang,et al.  On Generalized Multiple-instance Learning , 2005, Int. J. Comput. Intell. Appl..

[42]  Cordelia Schmid,et al.  Learning to Track for Spatio-Temporal Action Localization , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[43]  Tao Xiang,et al.  Weakly Supervised Action Detection , 2011, BMVC.

[44]  Fei-Fei Li,et al.  What's the Point: Semantic Segmentation with Point Supervision , 2015, ECCV.

[45]  Wei Chen,et al.  Action Detection by Implicit Intentional Motion Clustering , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[46]  Cees Snoek,et al.  APT: Action localization proposals from dense trajectories , 2015, BMVC.

[47]  Koen E. A. van de Sande,et al.  Selective Search for Object Recognition , 2013, International Journal of Computer Vision.

[48]  Suman Saha,et al.  Deep Learning for Detecting Multiple Space-Time Action Tubes in Videos , 2016, BMVC.

[49]  Tinne Tuytelaars,et al.  Action in chains: A chains model for action localization and classification , 2014, IEEE Winter Conference on Applications of Computer Vision.

[50]  Mubarak Shah,et al.  Unsupervised Action Discovery and Localization in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[51]  R. Prim Shortest connection networks and some generalizations , 1957 .

[52]  Suman Saha,et al.  TraMNet - Transition Matrix Network for Efficient Action Tube Proposals , 2018, ACCV.

[53]  Cordelia Schmid,et al.  AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[54]  Nuno Vasconcelos,et al.  Multiple instance learning for soft bags via top instances , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Patrick Bouthemy,et al.  Tubelets: Unsupervised Action Proposals from Spatiotemporal Super-Voxels , 2016, International Journal of Computer Vision.

[56]  Mubarak Shah,et al.  Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[57]  Ruslan Salakhutdinov,et al.  Action Recognition using Visual Attention , 2015, NIPS 2015.

[58]  Lei Zhang,et al.  AutoLoc: Weakly-supervised Temporal Action Localization , 2018, ECCV.

[59]  Cordelia Schmid,et al.  Towards Understanding Action Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[60]  Cees Snoek,et al.  Pointly-Supervised Action Localization , 2018, International Journal of Computer Vision.

[61]  Larry S. Davis,et al.  AVSS 2011 demo session: A large-scale benchmark dataset for event recognition in surveillance video , 2011, AVSS.

[62]  Luc Van Gool,et al.  UntrimmedNets for Weakly Supervised Action Recognition and Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[64]  Apostol Natsev,et al.  YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.

[65]  Santiago Manen,et al.  Prime Object Proposals with Randomized Prim's Algorithm , 2013, 2013 IEEE International Conference on Computer Vision.