An efficient end-to-end deep learning architecture for activity classification

Deep learning is widely considered to be the most important method in computer vision fields, which has a lot of applications such as image recognition, robot navigation systems and self-driving cars. Recent developments in neural networks have led to an efficient end-to-end architecture to human activity representation and classification. In the light of these recent events in deep learning, there is now much considerable concern about developing less expensive computation and memory-wise methods. This paper presents an optimized end-to-end approach to describe and classify human action videos. In the beginning, RGB activity videos are sampled to frame sequences. Then convolutional features are extracted from these frames based on the pre-trained Inception-v3 model. Finally, video actions classification is done by training a long short-term with feature vectors. Our proposed architecture aims to perform low computational cost and improved accuracy performances. Our efficient end-to-end approach outperforms previously published results by an accuracy rate of 98.4% and 98.5% on the UTD-MHAD HS and UTD-MHAD SS public dataset experiments, respectively.

[1]  Jeremy S. Smith,et al.  Hierarchical Multi-scale Attention Networks for action recognition , 2017, Signal Process. Image Commun..

[2]  Cordelia Schmid,et al.  Long-Term Temporal Convolutions for Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Ling Shao,et al.  Action Recognition Using 3D Histograms of Texture and A Multi-Class Boosting Classifier , 2017, IEEE Transactions on Image Processing.

[4]  Gang Wang,et al.  Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition , 2016, ECCV.

[5]  Cordelia Schmid,et al.  P-CNN: Pose-Based CNN Features for Action Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[6]  Sergio Escalera,et al.  A Survey on Deep Learning Based Approaches for Action and Gesture Recognition in Image Sequences , 2017, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).

[7]  Xiaodong Yang,et al.  Recognizing actions using depth motion maps-based histograms of oriented gradients , 2012, ACM Multimedia.

[8]  Li Xie,et al.  Fully convolutional networks for action recognition , 2017, IET Comput. Vis..

[9]  Yi Zhu,et al.  Deep Local Video Feature for Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[10]  Nasrollah Moghaddam Charkari,et al.  Survey on deep learning methods in human action recognition , 2017, IET Comput. Vis..

[11]  Nasser Kehtarnavaz,et al.  A Real-Time Human Action Recognition System Using Depth and Inertial Sensor Fusion , 2016, IEEE Sensors Journal.

[12]  Quanshi Zhang,et al.  Visual interpretability for deep learning: a survey , 2018, Frontiers of Information Technology & Electronic Engineering.

[13]  Edwin Escobedo,et al.  A New Approach for Dynamic Gesture Recognition Using Skeleton Trajectory Representation and Histograms of Cumulative Magnitudes , 2016, 2016 29th SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI).

[14]  Jitendra Malik,et al.  Contextual Action Recognition with R*CNN , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[15]  Forrest N. Iandola,et al.  SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size , 2016, ArXiv.

[16]  Yifeng He,et al.  Human Action Recognition Using Hybrid Centroid Canonical Correlation Analysis , 2015, 2015 IEEE International Symposium on Multimedia (ISM).

[17]  Sanghoon Lee,et al.  Ensemble Deep Learning for Skeleton-Based Action Recognition Using Temporal Sliding LSTM Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[18]  Baoxin Li,et al.  Multi-stream CNN: Learning representations based on human-related regions for action recognition , 2018, Pattern Recognit..

[19]  Cordelia Schmid,et al.  Multi-region Two-Stream R-CNN for Action Detection , 2016, ECCV.

[20]  Laurence T. Yang,et al.  A survey on deep learning for big data , 2018, Inf. Fusion.

[21]  Wei Zeng,et al.  Learning Long-Term Dependencies for Action Recognition with a Biologically-Inspired Deep Network , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[22]  Alok Kumar Singh Kushwaha,et al.  A recent survey for human activity recoginition based on deep learning approach , 2017, 2017 Fourth International Conference on Image Information Processing (ICIIP).

[23]  Jinwen Ma,et al.  DMMs-Based Multiple Features Fusion for Human Action Recognition , 2015, Int. J. Multim. Data Eng. Manag..

[24]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Mehrtash Tafazzoli Harandi,et al.  Going deeper into action recognition: A survey , 2016, Image Vis. Comput..

[27]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[28]  Mohamed Atri,et al.  Naive Bayesian Fusion for Action Recognition from Kinect , 2017 .

[29]  Trevor Darrell,et al.  Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[30]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[31]  Javed Imran,et al.  Human action recognition using RGB-D sensor and deep convolutional neural networks , 2016, 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI).

[32]  Yu Qiao,et al.  Recurrent Spatial-Temporal Attention Network for Action Recognition in Videos , 2018, IEEE Transactions on Image Processing.

[33]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Gang Wang,et al.  Global Context-Aware Attention LSTM Networks for 3D Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Nasser Kehtarnavaz,et al.  UTD-MHAD: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor , 2015, 2015 IEEE International Conference on Image Processing (ICIP).

[36]  Xiaofeng Wang,et al.  Action recognition using vague division DMMs , 2017 .

[37]  Xiaofeng Wang,et al.  Human action recognition using transfer learning with deep representations , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[38]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Parma Nand,et al.  Video Dynamics Detection Using Deep Neural Networks , 2018, IEEE Transactions on Emerging Topics in Computational Intelligence.