论文信息 - Fast Weakly Supervised Action Segmentation Using Mutual Consistency

Fast Weakly Supervised Action Segmentation Using Mutual Consistency

Action segmentation is the task of predicting the actions for each frame of a video. As obtaining the full annotation of videos for action segmentation is expensive, weakly supervised approaches that can learn only from transcripts are appealing. In this paper, we propose a novel end-to-end approach for weakly supervised action segmentation based on a two-branch neural network. The two branches of our network predict two redundant but different representations for action segmentation and we propose a novel mutual consistency (MuCon) loss that enforces the consistency of the two redundant representations. Using the MuCon loss together with a loss for transcript prediction, our proposed approach achieves the accuracy of state-of-the-art approaches while being 14 times faster to train and 20 times faster during inference. The MuCon loss proves beneficial even in the fully supervised setting.

[1] Cordelia Schmid,et al. Weakly Supervised Action Labeling in Videos under Ordering Constraints , 2014, ECCV.

[2] Andrew Zisserman,et al. Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[3] Andrew Zisserman,et al. Spatial Transformer Networks , 2015, NIPS.

[4] Jitendra Malik,et al. SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[5] Hilde Kuehne,et al. A Hybrid RNN-HMM Approach for Weakly Supervised Temporal Action Segmentation , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6] Fabio Viola,et al. The Kinetics Human Action Video Dataset , 2017, ArXiv.

[7] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[8] Shmuel Peleg,et al. Temporal Segmentation of Egocentric Videos , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[9] Radomír Mech,et al. Sequence-to-Segment Networks for Segment Detection , 2018, NeurIPS.

[10] Andrew Zisserman,et al. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[12] Ronald J. Williams,et al. A Learning Algorithm for Continually Running Fully Recurrent Neural Networks , 1989, Neural Computation.

[13] Martial Hebert,et al. Temporal segmentation and activity classification from first-person sensing , 2009, 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[14] On Evaluating Weakly Supervised Action Segmentation Methods , 2020, ArXiv.

[15] Kaiming He,et al. Group Normalization , 2018, ECCV.

[16] Luc Van Gool,et al. Temporal 3D ConvNets Using Temporal Transition Layer , 2018, CVPR Workshops.

[17] Thomas Serre,et al. An end-to-end generative framework for video segmentation and recognition , 2015, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[18] Limin Wang,et al. Temporal Action Detection with Structured Segment Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[19] Heiga Zen,et al. WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[20] Yazan Abu Farha,et al. MS-TCN: Multi-Stage Temporal Convolutional Network for Action Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21] Cordelia Schmid,et al. Temporal Localization of Actions with Actoms. , 2013, IEEE transactions on pattern analysis and machine intelligence.

[22] Thomas Serre,et al. The Language of Actions: Recovering the Syntax and Semantics of Goal-Directed Human Activities , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[23] M. Friedman. The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance , 1937 .

[24] Yann Dauphin,et al. Convolutional Sequence to Sequence Learning , 2017, ICML.

[25] Luc Van Gool,et al. Spatio-Temporal Channel Correlation Networks for Action Classification , 2018, ECCV.

[26] Gregory D. Hager,et al. Segmental Spatiotemporal CNNs for Fine-Grained Action Segmentation , 2016, ECCV.

[27] Bernt Schiele,et al. A database for fine grained activity detection of cooking activities , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[28] Gregory D. Hager,et al. Temporal Convolutional Networks for Action Segmentation and Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29] Giovanni Maria Farinella,et al. Personal-location-based temporal segmentation of egocentric videos for lifelogging applications , 2018, J. Vis. Commun. Image Represent..

[30] Abhinav Gupta,et al. Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31] Luc Van Gool,et al. Large Scale Holistic Video Understanding , 2019, ECCV.

[32] Juan Carlos Niebles,et al. D3TW: Discriminative Differentiable Dynamic Time Warping for Weakly Supervised Action Alignment and Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33] Juan Carlos Niebles,et al. Connectionist Temporal Modeling for Weakly Supervised Action Labeling , 2016, ECCV.

[34] Jun Li,et al. Weakly Supervised Energy-Based Learning for Action Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[35] Chenliang Xu,et al. Weakly-Supervised Action Segmentation with Iterative Soft Boundary Assignment , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[37] Ming Shao,et al. A Multi-stream Bi-directional Recurrent Neural Network for Fine-Grained Action Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38] Juergen Gall,et al. Weakly supervised learning of actions from transcripts , 2016, Comput. Vis. Image Underst..

[39] Yoshua Bengio,et al. On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[40] Juergen Gall,et al. NeuralNetwork-Viterbi: A Framework for Weakly Supervised Video Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[41] Juergen Gall,et al. Weakly Supervised Action Learning with RNN Based Fine-to-Coarse Modeling , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42] Juergen Gall,et al. Temporal Action Detection Using a Statistical Language Model , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).