Multi-view Action Recognition Using Cross-View Video Prediction

In this work, we address the problem of action recognition in a multi-view environment. Most of the existing approaches utilize pose information for multi-view action recognition. We focus on RGB modality instead and propose an unsupervised representation learning framework, which encodes the scene dynamics in videos captured from multiple viewpoints via predicting actions from unseen views. The framework takes multiple short video clips from different viewpoints and time as input and learns an holistic internal representation which is used to predict a video clip from an unseen viewpoint and time. The ability of the proposed network to render unseen video frames enables it to learn a meaningful and robust representation of the scene dynamics. We evaluate the effectiveness of the learned representation for multiview video action recognition in a supervised approach. We observe a significant improvement in the performance with RGB modality on NTU-RGB+D dataset, which is the largest dataset for multi-view action recognition. The proposed framework also achieves state-of-the-art results with depth modality, which validates the generalization capability of the approach to other data modalities. The code is publicly available at https://github.com/svyas23/cross-view-action.

[1]  Gang Wang,et al.  Deep Multimodal Feature Analysis for Action Recognition in RGB+D Videos , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Martial Hebert,et al.  Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification , 2016, ECCV.

[3]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[4]  Jan Kautz,et al.  MoCoGAN: Decomposing Motion and Content for Video Generation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Li Fei-Fei,et al.  Unsupervised Learning of Long-Term Motion Dynamics for Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Xiaodong Yang,et al.  Super Normal Vector for Activity Recognition Using Depth Sequences , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Yann LeCun,et al.  Deep multi-scale video prediction beyond mean square error , 2015, ICLR.

[9]  Arif Mahmood,et al.  Histogram of Oriented Principal Components for Cross-View Action Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Gang Wang,et al.  NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Jiaya Jia,et al.  View Independent Generative Adversarial Network for Novel View Synthesis , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[12]  Nanning Zheng,et al.  View Adaptive Neural Networks for High Performance Skeleton-Based Human Action Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Kristen Grauman,et al.  ShapeCodes: Self-supervised Feature Learning by Lifting Views to Viewgrids , 2017, ECCV.

[14]  Nanning Zheng,et al.  Adding Attentiveness to the Neurons in Recurrent Neural Networks , 2018, ECCV.

[15]  Fei Wu,et al.  Spatio-Temporal Graph Routing for Skeleton-Based Action Recognition , 2019, AAAI.

[16]  Gang Wang,et al.  Skeleton-Based Action Recognition Using Spatio-Temporal LSTM Network with Trust Gates , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Mohan M. Trivedi,et al.  Joint Angles Similarities and HOG2 for Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[18]  Mohan S. Kankanhalli,et al.  Unsupervised Learning of View-invariant Action Representations , 2018, NeurIPS.

[19]  Ruonan Li,et al.  Discriminative virtual views for cross-view action recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Jian Yang,et al.  Spatio-Temporal Graph Convolution for Skeleton Based Action Recognition , 2018, AAAI.

[21]  Ajmal S. Mian,et al.  Learning a non-linear knowledge transfer model for cross-view action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Mohammed Bennamoun,et al.  A New Representation of Skeleton Sequences for 3D Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Tomaso Poggio,et al.  A fast, invariant representation for human action in the visual system. , 2018, Journal of neurophysiology.

[24]  Ying Wu,et al.  Cross-View Action Modeling, Learning, and Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Jürgen Schmidhuber,et al.  LSTM can Solve Hard Long Time Lag Problems , 1996, NIPS.

[26]  Antonio Torralba,et al.  Generating Videos with Scene Dynamics , 2016, NIPS.

[27]  Koray Kavukcuoglu,et al.  Neural scene representation and rendering , 2018, Science.

[28]  Nitish Srivastava Unsupervised Learning of Visual Representations using Videos , 2015 .

[29]  Christian Ledig,et al.  Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Oswald Lanz,et al.  View-LSTM: Novel-View Video Synthesis Through View Decomposition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[31]  David Picard,et al.  2D/3D Pose Estimation and Action Recognition Using Multitask Deep Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32]  Ming-Hsuan Yang,et al.  Unsupervised Representation Learning by Sorting Sequences , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[33]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[34]  Efstratios Gavves,et al.  Self-Supervised Video Representation Learning with Odd-One-Out Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Christian Wolf,et al.  Human Action Recognition: Pose-Based Attention Draws Focus to Hands , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[36]  Nanning Zheng,et al.  Semantics-Guided Neural Networks for Efficient Skeleton-Based Human Action Recognition , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Dahua Lin,et al.  Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition , 2018, AAAI.

[38]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Dong Xu,et al.  Dividing and Aggregating Network for Multi-view Action Recognition , 2018, ECCV.

[40]  Hassen Drira,et al.  Coding Kendall's Shape Trajectories for 3D Action Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[41]  James J. Little,et al.  3D Pose from Motion for Cross-View Action Recognition via Non-linear Circulant Temporal Encoding , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[42]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[43]  Ali Borji,et al.  Cross-View Image Synthesis Using Conditional GANs , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[44]  Eric P. Xing,et al.  Nonparametric Variational Auto-Encoders for Hierarchical Representation Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[45]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Zicheng Liu,et al.  HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[47]  Yong Du,et al.  Hierarchical recurrent neural network for skeleton based action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Binlong Li,et al.  Cross-view activity recognition using Hankelets , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[49]  Lin Gao,et al.  Graph CNNs with Motif and Variable Temporal Block for Skeleton-Based Action Recognition , 2019, AAAI.

[50]  Petros Koumoutsakos,et al.  ContextVP: Fully Context-Aware Video Prediction , 2017, ECCV.

[51]  Gang Wang,et al.  Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition , 2016, ECCV.

[52]  Shunta Saito,et al.  Temporal Generative Adversarial Nets with Singular Value Clipping , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[53]  Hong Liu,et al.  Enhanced skeleton visualization for view invariant human action recognition , 2017, Pattern Recognit..

[54]  Jeff Donahue,et al.  Efficient Video Generation on Complex Datasets , 2019, ArXiv.