Evaluating fusion of RGB-D and inertial sensors for multimodal human action recognition

Fusion of multiple modalities from different sensors is an important area of research for multimodal human action recognition. In this paper, we conduct an in-depth study to investigate the effect of different parameters like input preprocessing, data augmentation, network architectures and model fusion so as to come up with a practical guideline for multimodal action recognition using deep learning paradigm. First, for RGB videos, we propose a novel image-based descriptor called stacked dense flow difference image (SDFDI), capable of capturing the spatio-temporal information present in a video sequence. A variety of deep 2D convolutional neural networks (CNN) are then trained to compare our SDFDI against state-of-the-art image-based representations. Second, for skeleton stream, we propose data augmentation technique based on 3D transformations so as to facilitate training a deep neural network on small datasets. We also propose a bidirectional gated recurrent unit (BiGRU) based recurrent neural network (RNN) to model skeleton data. Third, for inertial sensor data, we propose data augmentation based on jittering with white Gaussian noise along with deep a 1D-CNN network for action classification. The outputs of all these three heterogeneous networks (1D-CNN, 2D-CNN and BiGRU) are combined by a variety of model fusion approach based on score and feature fusion. Finally, in order to illustrate the efficacy of the proposed framework, we test our model on a publicly available UTD-MHAD dataset, and achieved an overall accuracy of 97.91%, which is about 4% higher than using each modality individually. We hope that the discussions and conclusions from this work will provide a deeper insight to the researchers in the related fields, and provide avenues for further studies for different multi-sensor based fusion architectures.

[1]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Yan Liu,et al.  A new method of feature fusion and its application in image recognition , 2005, Pattern Recognit..

[3]  Diane J. Cook,et al.  Ambient and smartphone sensor assisted ADL recognition in multi-inhabitant smart environments , 2016, J. Ambient Intell. Humaniz. Comput..

[4]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[5]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[6]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Andrea Vedaldi,et al.  Dynamic Image Networks for Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Pichao Wang,et al.  Skeleton Optical Spectra-Based Action Recognition Using Convolutional Neural Networks , 2018, IEEE Transactions on Circuits and Systems for Video Technology.

[9]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[10]  Yupu Yang,et al.  Modeling coverage with semantic embedding for image caption generation , 2018, The Visual Computer.

[11]  Nasser Kehtarnavaz,et al.  Fusion of depth, skeleton, and inertial data for human action recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Ilkka Korhonen,et al.  Detection of Daily Activities and Sports With Wearable Sensors in Controlled and Uncontrolled Conditions , 2008, IEEE Transactions on Information Technology in Biomedicine.

[13]  Chen Chen,et al.  Deep Fisher discriminant learning for mobile hand gesture recognition , 2017, Pattern Recognit..

[14]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[16]  Gunnar Farnebäck,et al.  Two-Frame Motion Estimation Based on Polynomial Expansion , 2003, SCIA.

[17]  Jing Tian,et al.  Action recognition using multi-directional projected depth motion maps , 2018, Journal of Ambient Intelligence and Humanized Computing.

[18]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[19]  François Chollet,et al.  Xception: Deep Learning with Depthwise Separable Convolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Alex Mihailidis,et al.  Aggressive and agitated behavior recognition from accelerometer data using non-negative matrix factorization , 2018, J. Ambient Intell. Humaniz. Comput..

[21]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[22]  Mu Zhou,et al.  HandSense: smart multimodal hand gesture recognition based on deep neural networks , 2018, Journal of Ambient Intelligence and Humanized Computing.

[23]  Cordelia Schmid,et al.  Long-Term Temporal Convolutions for Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Gang Wang,et al.  Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition , 2016, ECCV.

[25]  Javed Imran,et al.  Combining CNN streams of RGB-D and skeletal data for human activity recognition , 2018, Pattern Recognit. Lett..

[26]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[27]  Daniel Roggen,et al.  Deep Convolutional and LSTM Recurrent Neural Networks for Multimodal Wearable Activity Recognition , 2016, Sensors.

[28]  Mohamed Abdel-Mottaleb,et al.  Discriminant Correlation Analysis: Real-Time Feature Level Fusion for Multimodal Biometric Recognition , 2016, IEEE Transactions on Information Forensics and Security.

[29]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Alex Mihailidis,et al.  Feature-level combination of skeleton joints and body parts for accurate aggressive and agitated behavior recognition , 2017, J. Ambient Intell. Humaniz. Comput..

[31]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[32]  Ge Chen,et al.  Hand joints-based gesture recognition for noisy dataset using nested interval unscented Kalman filter with LSTM network , 2018, The Visual Computer.

[33]  David Dagan Feng,et al.  Dual-Path Adversarial Learning for Fully Convolutional Network (FCN)-Based Medical Image Segmentation , 2018, The Visual Computer.

[34]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[35]  Hongsong Wang,et al.  Modeling Temporal Dynamics and Spatial Configurations of Actions Using Two-Stream Recurrent Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[37]  Xiaoming Liu,et al.  On Geometric Features for Skeleton-Based Action Recognition Using Multilayer LSTM Networks , 2017, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).

[38]  Nasser Kehtarnavaz,et al.  UTD-MHAD: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor , 2015, 2015 IEEE International Conference on Image Processing (ICIP).

[39]  Jörgen Ahlberg,et al.  Fast facial expression recognition using local binary features and shallow neural networks , 2018, The Visual Computer.

[40]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[41]  Javed Imran,et al.  Human action recognition using RGB-D sensor and deep convolutional neural networks , 2016, 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI).

[42]  Xiaohui Xie,et al.  Co-Occurrence Feature Learning for Skeleton Based Action Recognition Using Regularized Deep LSTM Networks , 2016, AAAI.

[43]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[44]  Bowen Zhang,et al.  Real-Time Action Recognition with Enhanced Motion Vector CNNs , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Szilveszter Pletl,et al.  Online human movement classification using wrist-worn wireless sensors , 2017, Journal of Ambient Intelligence and Humanized Computing.

[46]  Greg Mori,et al.  Structure Inference Machines: Recurrent Neural Networks for Analyzing Relations in Group Activity Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Plamen Angelov,et al.  A Comprehensive Review on Handcrafted and Learning-Based Action Representation Approaches for Human Activity Recognition , 2017 .

[48]  Ennio Gambi,et al.  Proposal and Experimental Evaluation of Fall Detection Solution Based on Wearable and Depth Data Fusion , 2015, ICT Innovations.

[49]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[50]  Samuel Berlemont,et al.  BLSTM-RNN Based 3D Gesture Classification , 2013, ICANN.

[51]  Thomas Brox,et al.  High Accuracy Optical Flow Estimation Based on a Theory for Warping , 2004, ECCV.

[52]  Nasser Kehtarnavaz,et al.  Fusion of Inertial and Depth Sensor Data for Robust Hand Gesture Recognition , 2014, IEEE Sensors Journal.

[53]  Pichao Wang,et al.  Action Recognition Based on Joint Trajectory Maps Using Convolutional Neural Networks , 2016, ACM Multimedia.

[54]  Yongdong Zhang,et al.  A Fast Uyghur Text Detector for Complex Background Images , 2018, IEEE Transactions on Multimedia.

[55]  Gang Zhou,et al.  Accurate, Fast Fall Detection Using Gyroscopes and Accelerometer-Derived Posture Information , 2009, 2009 Sixth International Workshop on Wearable and Implantable Body Sensor Networks.

[56]  Andrés Pérez-Uribe,et al.  Indoor Activity Recognition by Combining One-vs.-All Neural Network Classifiers Exploiting Wearable and Depth Sensors , 2013, IWANN.

[57]  Gang Wang,et al.  NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[59]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[60]  Thomas Bashford-Rogers,et al.  An evaluation of power transfer functions for HDR video compression , 2018, The Visual Computer.

[61]  Guangcan Liu,et al.  Deeper cascaded peak-piloted network for weak expression recognition , 2018, The Visual Computer.

[62]  Hui Huang,et al.  Learning a convolutional neural network for propagation-based stereo image segmentation , 2018, The Visual Computer.

[63]  Yong Hu,et al.  MSANet: multimodal self-augmentation and adversarial network for RGB-D object recognition , 2019, The Visual Computer.

[64]  Qionghai Dai,et al.  Cross-Modality Bridging and Knowledge Transferring for Image Understanding , 2019, IEEE Transactions on Multimedia.

[65]  Horst Bischof,et al.  A Duality Based Approach for Realtime TV-L1 Optical Flow , 2007, DAGM-Symposium.

[66]  Marwan Torki,et al.  Human Action Recognition Using a Temporal Hierarchy of Covariance Descriptors on 3D Joint Locations , 2013, IJCAI.

[67]  Yifeng He,et al.  Human action recognition via multiview discriminative analysis of canonical correlations , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[68]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[69]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[70]  Silvio Savarese,et al.  Social LSTM: Human Trajectory Prediction in Crowded Spaces , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[71]  Jing Zhang,et al.  Action Recognition From Depth Maps Using Deep Convolutional Neural Networks , 2016, IEEE Transactions on Human-Machine Systems.

[72]  Patrick van der Smagt,et al.  Two-stream RNN/CNN for action recognition in 3D videos , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[73]  Shuang Wang,et al.  Structured Images for RGB-D Action Recognition , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[74]  Billur Barshan,et al.  Human Activity Recognition Using Inertial/Magnetic Sensor Units , 2010, HBU.

[75]  Tao Mei,et al.  Action Recognition by Learning Deep Multi-Granular Spatio-Temporal Video Representation , 2016, ICMR.