View-Invariant Deep Architecture for Human Action Recognition Using Two-Stream Motion and Shape Temporal Dynamics

Human action Recognition for unknown views, is a challenging task. We propose a deep view-invariant human action recognition framework, which is a novel integration of two important action cues: motion and shape temporal dynamics (STD). The motion stream encapsulates the motion content of action as RGB Dynamic Images (RGB-DIs), which are generated by Approximate Rank Pooling (ARP) and processed by using fine-tuned InceptionV3 model. The STD stream learns long-term view-invariant shape dynamics of action using a sequence of LSTM and Bi-LSTM learning models. Human Pose Model (HPM) generates view-invariant features of structural similarity index matrix (SSIM) based key depth human pose frames. The final prediction of the action is made on the basis of three types of late fusion techniques i.e. maximum (max), average (avg) and multiply (mul), applied on individual stream scores. To validate the performance of the proposed novel framework, the experiments are performed using both cross-subject and cross-view validation schemes on three publically available benchmarks-NUCLA multi-view dataset, UWA3D-II Activity dataset and NTU RGB-D Activity dataset. Our algorithm outperforms existing state-of-the-arts significantly, which is measured in terms of recognition accuracy, receiver operating characteristic (ROC) curve and area under the curve (AUC).

[1]  Andrea Vedaldi,et al.  Dynamic Image Networks for Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[3]  Takio Kurita,et al.  Gated spatio and temporal convolutional neural network for activity recognition: towards gated multimodal deep learning , 2017, EURASIP J. Image Video Process..

[4]  Jiwen Lu,et al.  Multi-modal uniform deep learning for RGB-D person re-identification , 2017, Pattern Recognit..

[5]  James W. Davis,et al.  The Recognition of Human Movement Using Temporal Templates , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Kuldeep Singh,et al.  Human Activity Recognition Based on Spatial Distribution of Gradients at Sublevels of Average Energy Silhouette Images , 2017, IEEE Transactions on Cognitive and Developmental Systems.

[7]  Wenjun Zeng,et al.  An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data , 2016, AAAI.

[8]  Chunheng Wang,et al.  Cross-View Action Recognition via a Continuous Virtual Path , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Arif Mahmood,et al.  Histogram of Oriented Principal Components for Cross-View Action Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Austin Reiter,et al.  Interpretable 3D Human Action Analysis with Temporal Convolutional Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[11]  James J. Little,et al.  3D Pose from Motion for Cross-View Action Recognition via Non-linear Circulant Temporal Encoding , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Ling Shao,et al.  Enhanced Computer Vision With Microsoft Kinect Sensor: A Review , 2013, IEEE Transactions on Cybernetics.

[13]  Gang Wang,et al.  NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Ajmal Mian,et al.  3D Action Recognition from Novel Viewpoints , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Chalavadi Krishna Mohan,et al.  Human action recognition in RGB-D videos using motion sequence information and deep learning , 2017, Pattern Recognit..

[16]  Hengcan Shi,et al.  Gaze-Assisted Multi-Stream Deep Neural Network for Action Recognition , 2017, IEEE Access.

[17]  Fabio Cuzzolin,et al.  3D Activity Recognition Using Motion History and Binary Shape Templates , 2014, ACCV Workshops.

[18]  Jun Li,et al.  Deeply Learned View-Invariant Features for Cross-View Action Recognition , 2017, IEEE Transactions on Image Processing.

[19]  Baoxin Li,et al.  Multi-stream CNN: Learning representations based on human-related regions for action recognition , 2018, Pattern Recognit..

[20]  Ajmal S. Mian,et al.  Learning a non-linear knowledge transfer model for cross-view action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Ying Wu,et al.  Cross-View Action Modeling, Learning, and Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Bernhard Schölkopf,et al.  A tutorial on support vector regression , 2004, Stat. Comput..

[23]  Larry H. Matthies,et al.  Pooled motion features for first-person videos , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Ali Seydi Keçeli Viewpoint projection based deep feature learning for single and dyadic action recognition , 2018, Expert Syst. Appl..

[25]  Ajmal Mian,et al.  Learning Human Pose Models from Synthesized Data for Robust RGB-D Action Recognition , 2017, International Journal of Computer Vision.

[26]  Alexander J. Smola,et al.  Compressed Video Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Tinne Tuytelaars,et al.  Modeling video evolution for action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Mubarak Shah,et al.  Learning a Deep Model for Human Action Recognition from Novel Viewpoints , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Jian Liu,et al.  Skepxels: Spatio-temporal Image Representation of Human Skeleton Joints for Action Recognition , 2017, CVPR Workshops.

[31]  Yang Xiao,et al.  Action Recognition for Depth Video using Multi-view Dynamic Images , 2018, Inf. Sci..

[32]  Mubarak Shah,et al.  Discovering Motion Primitives for Unsupervised Grouping and One-Shot Learning of Human Actions, Gestures, and Expressions , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Tej Singh,et al.  Video benchmarks of human action datasets: a review , 2018, Artificial Intelligence Review.

[35]  Binlong Li,et al.  Cross-view activity recognition using Hankelets , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  Hao Wang,et al.  Exploring hybrid spatio-temporal convolutional networks for human action recognition , 2017, Multimedia Tools and Applications.

[37]  Wenjun Zeng,et al.  Spatio-Temporal Attention-Based LSTM Networks for 3D Action Recognition and Detection , 2018, IEEE Transactions on Image Processing.

[38]  Jingjing Zheng,et al.  Learning View-Invariant Sparse Representations for Cross-View Action Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[39]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[40]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[41]  Luc Van Gool,et al.  Action snippets: How many frames does human action recognition require? , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[42]  Naveed Akhtar,et al.  Viewpoint Invariant RGB-D Human Action Recognition , 2017, 2017 International Conference on Digital Image Computing: Techniques and Applications (DICTA).

[43]  Luc Van Gool,et al.  Temporal 3D ConvNets: New Architecture and Transfer Learning for Video Classification , 2017, ArXiv.

[44]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.

[45]  Gang Wang,et al.  Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition , 2016, ECCV.

[46]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[47]  Shuang Wang,et al.  Structured Images for RGB-D Action Recognition , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[48]  Weiyao Lin,et al.  Action Recognition with Coarse-to-Fine Deep Feature Integration and Asynchronous Fusion , 2018, AAAI.

[49]  Md. Atiqur Rahman Ahad,et al.  Motion history image: its variants and applications , 2012, Machine Vision and Applications.