Fine-grained Action Segmentation using the Semi-Supervised Action GAN

Abstract In this paper we address the problem of continuous fine-grained action segmentation, in which multiple actions are present in an unsegmented video stream. The challenge for this task lies in the need to represent the hierarchical nature of the actions and to detect the transitions between actions, allowing us to localise the actions within the video effectively. We propose a novel recurrent semi-supervised Generative Adversarial Network (GAN) model for continuous fine-grained human action segmentation. Temporal context information is captured via a novel Gated Context Extractor (GCE) module, composed of gated attention units, that directs the queued context information through the generator model, for enhanced action segmentation. The GAN is made to learn features in a semi-supervised manner, enabling the model to perform action classification jointly with the standard, unsupervised, GAN learning procedure. We perform extensive evaluations on different architectural variants to demonstrate the importance of the proposed network architecture, and show that it is capable of outperforming current state-of-the-art on three challenging datasets: 50 Salads, MERL Shopping and Georgia Tech Egocentric Activities dataset.

[1]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[2]  Ivan Marsic,et al.  Region-based Activity Recognition Using Conditional GAN , 2017, ACM Multimedia.

[3]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[4]  Sridha Sridharan,et al.  Tracking by Prediction: A Deep Generative Model for Mutli-person Localisation and Tracking , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[5]  Alexei A. Efros,et al.  Context Encoders: Feature Learning by Inpainting , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Baoxin Li,et al.  Multi-stream CNN: Learning representations based on human-related regions for action recognition , 2018, Pattern Recognit..

[7]  Ming Shao,et al.  A Multi-stream Bi-directional Recurrent Neural Network for Fine-Grained Action Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Ivan Laptev,et al.  Recognizing human actions in still images: a study of bag-of-features and part-based representations , 2010, BMVC.

[9]  Juan José Pantrigo,et al.  Convolutional Neural Networks and Long Short-Term Memory for skeleton-based human activity and hand gesture recognition , 2018, Pattern Recognit..

[10]  Juergen Gall,et al.  Temporal Action Detection Using a Statistical Language Model , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[12]  Gregory D. Hager,et al.  Segmental Spatiotemporal CNNs for Fine-Grained Action Segmentation , 2016, ECCV.

[13]  Xiao-Yuan Jing,et al.  Robust Visual Tracking Using Multi-Frame Multi-Feature Joint Modeling , 2018, IEEE Transactions on Circuits and Systems for Video Technology.

[14]  John Salvatier,et al.  Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.

[15]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[16]  Sridha Sridharan,et al.  Coupled Generative Adversarial Network for Continuous Fine-Grained Action Segmentation , 2019, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[17]  Fan Yang,et al.  Good Semi-supervised Learning That Requires a Bad GAN , 2017, NIPS.

[18]  Tamara L. Berg,et al.  Learning Temporal Transformations from Time-Lapse Videos , 2016, ECCV.

[19]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[20]  Ali Farhadi,et al.  Understanding egocentric activities , 2011, 2011 International Conference on Computer Vision.

[21]  Chenliang Xu,et al.  TricorNet: A Hybrid Temporal Convolutional and Recurrent Network for Video Action Segmentation , 2017, ArXiv.

[22]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[23]  Yann LeCun,et al.  Deep multi-scale video prediction beyond mean square error , 2015, ICLR.

[24]  Petros Daras,et al.  Motion analysis: Action detection, recognition and evaluation based on motion capture data , 2018, Pattern Recognit..

[25]  F. Xavier Roca,et al.  On Importance of Interactions and Context in Human Action Recognition , 2011, IbPRIA.

[26]  Paul Geladi,et al.  Principal Component Analysis , 1987, Comprehensive Chemometrics.

[27]  Jean-Christophe Nebel,et al.  Episodic Reasoning for Vision-Based Human Action Recognition , 2014, TheScientificWorldJournal.

[28]  C. V. Jawahar,et al.  First Person Action Recognition Using Deep Learned Descriptors , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Fei-Fei Li,et al.  Recognizing Human-Object Interactions in Still Images by Modeling the Mutual Context of Objects and Human Poses , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Anton van den Hengel,et al.  Learning discriminative trajectorylet detector sets for accurate skeleton-based action recognition , 2015, Pattern Recognit..

[31]  Stephen J. McKenna,et al.  Combining embedded accelerometers with computer vision for recognizing food preparation activities , 2013, UbiComp.

[32]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[33]  Namil Kim,et al.  Pixel-Level Domain Transfer , 2016, ECCV.

[34]  Nassir Navab,et al.  Sensor substitution for video-based action recognition , 2016, 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[35]  Kai Liu,et al.  Tensor-based linear dynamical systems for action recognition from 3D skeletons , 2018, Pattern Recognit..

[36]  Mohit Bansal,et al.  Interpreting Neural Networks to Improve Politeness Comprehension , 2016, EMNLP.

[37]  Alexandros G. Dimakis,et al.  Compressed Sensing using Generative Models , 2017, ICML.

[38]  Simon Osindero,et al.  Conditional Generative Adversarial Nets , 2014, ArXiv.

[39]  Fabio A. González,et al.  Gated Multimodal Units for Information Fusion , 2017, ICLR.

[40]  Ioannis A. Kakadiaris,et al.  A Review of Human Activity Recognition Methods , 2015, Front. Robot. AI.

[41]  Gregory D. Hager,et al.  Temporal Convolutional Networks for Action Segmentation and Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Stefano Ermon,et al.  InfoGAIL: Interpretable Imitation Learning from Visual Demonstrations , 2017, NIPS.

[43]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[44]  Rob Fergus,et al.  Semi-Supervised Learning with Context-Conditional Generative Adversarial Networks , 2016, ArXiv.

[45]  Sinisa Todorovic,et al.  Temporal Deformable Residual Networks for Action Segmentation in Videos , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[46]  Gregory D. Hager,et al.  Learning convolutional action primitives for fine-grained action recognition , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[47]  Chen Sun,et al.  DiscrimNet: Semi-Supervised Action Recognition from Videos using Generative Adversarial Networks , 2018, ArXiv.

[48]  Ivan Laptev,et al.  Learning person-object interactions for action recognition in still images , 2011, NIPS.

[49]  Yibin Li,et al.  Region-sequence based six-stream CNN features for general and fine-grained human action recognition in videos , 2018, Pattern Recognit..

[50]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  James M. Rehg,et al.  Learning to recognize objects in egocentric activities , 2011, CVPR 2011.

[52]  Ling Shao,et al.  Boosted key-frame selection and correlated pyramidal motion-feature representation for human action recognition , 2013, Pattern Recognit..

[53]  Sridha Sridharan,et al.  Multi-Level Sequence GAN for Group Activity Recognition , 2018, ACCV.