A deep multimodal network based on bottleneck layer features fusion for action recognition

Human Activity Recognition (HAR) in videos using convolution neural network become the preferred choice for researcher due to the tremendous success of deep learning models for visual recognition applications. After the invention of the low-cost depth sensor, multiple modalities based activity recognition systems were successfully developed in the past decade. Although it is always challenging to recognize the complex human activities in videos. In this work, we proposed a deep bottleneck multimodal feature fusion (D-BMFF) framework that fused three different modalities of RGB, RGB-D(depth) and 3D coordinates information for activity classification. It helps to better recognize and make full use of information available simultaneously from a depth sensor. During the training process RGB and depth, frames are fed at regular intervals for an activity video while 3D coordinates are first converted into single RGB skeleton motion history image (RGB-SklMHI). We have extracted the features from multimodal data inputs using the latest deep pre-trained network architecture. The multimodal feature obtained from bottleneck layers before the top layer is fused by using multiset discriminant correlation analysis (M-DCA), which allows for robust visual action modeling. Finally, using a linear multiclass support vector machine (SVM) method, the fused features are categorized. The proposed approach is evaluated over four standard RGB-D datasets: UT-Kinect, CAD-60, Florence 3D and SBU Interaction. Our framework produces outstanding results and outperformed the state-of-the-art methods.

[1]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[2]  Mohamed Abdel-Mottaleb,et al.  Discriminant Correlation Analysis: Real-Time Feature Level Fusion for Multimodal Biometric Recognition , 2016, IEEE Transactions on Information Forensics and Security.

[3]  Cho Nilar Phyo,et al.  Deep Learning for Recognizing Human Activities Using Motions of Skeletal Joints , 2019, IEEE Transactions on Consumer Electronics.

[4]  Jun Xiao,et al.  Explorations of skeleton features for LSTM-based action recognition , 2017, Multimedia Tools and Applications.

[5]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Yan Liu,et al.  A new method of feature fusion and its application in image recognition , 2005, Pattern Recognit..

[7]  Jake K. Aggarwal,et al.  Human activity recognition from 3D data: A review , 2014, Pattern Recognit. Lett..

[8]  Al Alwani Adnan Salih,et al.  Spatiotemporal representation of 3D skeleton joints-based action recognition using modified spherical harmonics , 2016 .

[9]  Ronen Basri,et al.  Actions as space-time shapes , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[10]  Gian Luca Foresti,et al.  Fusing depth and colour information for human action recognition , 2018, Multimedia Tools and Applications.

[11]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[13]  Jun Li,et al.  Human activity recognition using dynamic representation and matching of skeleton feature sequences from RGB-D images , 2018, Signal Process. Image Commun..

[14]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[15]  Bart Selman,et al.  Unstructured human activity detection from RGBD images , 2011, 2012 IEEE International Conference on Robotics and Automation.

[16]  Wei Liu,et al.  Latent Max-Margin Multitask Learning With Skelets for 3-D Action Recognition , 2017, IEEE Transactions on Cybernetics.

[17]  Ahmet Burak Can,et al.  Combining 2D and 3D deep models for action recognition with depth information , 2018, Signal Image Video Process..

[18]  Dapeng Tao,et al.  Skeleton embedded motion body partition for human action recognition using depth sequences , 2018, Signal Process..

[19]  Eun-Soo Kim,et al.  Hierarchical topic modeling with pose-transition feature for action recognition using 3D skeleton data , 2018, Inf. Sci..

[20]  Dinesh Kumar Vishwakarma,et al.  A deeply coupled ConvNet for human activity recognition using dynamic and RGB images , 2020, Neural Computing and Applications.

[21]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[22]  Alberto Del Bimbo,et al.  Submitted to Ieee Transactions on Cybernetics 1 3d Human Action Recognition by Shape Analysis of Motion Trajectories on Riemannian Manifold , 2022 .

[23]  Jun Kong,et al.  Collaborative multimodal feature learning for RGB-D action recognition , 2019, J. Vis. Commun. Image Represent..

[24]  Xiaoming Liu,et al.  On Geometric Features for Skeleton-Based Action Recognition Using Multilayer LSTM Networks , 2017, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).

[25]  James W. Davis,et al.  The Recognition of Human Movement Using Temporal Templates , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[26]  Anil K. Jain,et al.  Decision-level fusion in fingerprint verification , 2001, Pattern Recognit..

[27]  Anuj Srivastava,et al.  Action Recognition Using Rate-Invariant Analysis of Skeletal Shape Trajectories , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Cordelia Schmid,et al.  Human Detection Using Oriented Histograms of Flow and Appearance , 2006, ECCV.

[29]  Dimitris Samaras,et al.  Two-person interaction detection using body-pose features and multiple instance learning , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[30]  Javed Imran,et al.  Combining CNN streams of RGB-D and skeletal data for human activity recognition , 2018, Pattern Recognit. Lett..

[31]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[32]  Ahmad Reza Naghsh-Nilchi,et al.  A weighting scheme for mining key skeletal joints for human action recognition , 2019, Multimedia Tools and Applications.

[33]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[34]  Tej Singh,et al.  A visual cognizance based multi-resolution descriptor for human action recognition using key pose , 2019, AEU - International Journal of Electronics and Communications.

[35]  Chalavadi Krishna Mohan,et al.  Human action recognition in RGB-D videos using motion sequence information and deep learning , 2017, Pattern Recognit..

[36]  Yang Yang,et al.  One-shot learning based pattern transition map for action early recognition , 2018, Signal Process..

[37]  Thanh Phuong Nguyen,et al.  Hierarchical Gaussian descriptor based on local pooling for action recognition , 2018, Machine Vision and Applications.

[38]  Dario Maio,et al.  A multimodal approach for human activity recognition based on skeleton and RGB data , 2020, Pattern Recognit. Lett..

[39]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Zheru Chi,et al.  Skeleton-Based Action Recognition With Key-Segment Descriptor and Temporal Step Matrix Model , 2019, IEEE Access.

[41]  Rama Chellappa,et al.  Human Action Recognition by Representing 3D Skeletons as Points in a Lie Group , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[42]  Alberto Del Bimbo,et al.  Recognizing Actions from Depth Cameras as Weakly Aligned Multi-part Bag-of-Poses , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[43]  Hoda Mohammadzade,et al.  Simultaneous Joint and Object Trajectory Templates for Human Activity Recognition from 3-D Data , 2017, J. Vis. Commun. Image Represent..

[44]  Tej Singh,et al.  Video benchmarks of human action datasets: a review , 2018, Artificial Intelligence Review.

[45]  Qingshan She,et al.  Spatio-temporal SRU with global context-aware attention for 3D human action recognition , 2020, Multimedia Tools and Applications.

[46]  Jake K. Aggarwal,et al.  View invariant human action recognition using histograms of 3D joints , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[47]  Max Q.-H. Meng,et al.  Skeleton-Based Human Action Recognition by Pose Specificity and Weighted Voting , 2018, Int. J. Soc. Robotics.

[48]  Stephen J. Maybank,et al.  Activity recognition using a supervised non-parametric hierarchical HMM , 2016, Neurocomputing.