CNN-Based Multistage Gated Average Fusion (MGAF) for Human Action Recognition Using Depth and Inertial Sensors

Convolutional Neural Network (CNN) provides leverage to extract and fuse features from all layers of its architecture. However, extracting and fusing intermediate features from different layers of CNN structure is still uninvestigated for Human Action Recognition (HAR) using depth and inertial sensors. To get maximum benefit of accessing all the CNN’s layers, in this paper, we propose novel Multistage Gated Average Fusion (MGAF) network which extracts and fuses features from all layers of CNN using our novel and computationally efficient Gated Average Fusion (GAF) network, a decisive integral element of MGAF. At the input of the proposed MGAF, we transform the depth and inertial sensor data into depth images called sequential front view images (SFI) and signal images (SI) respectively. These SFI are formed from the front view information generated by depth data. CNN is employed to extract feature maps from both input modalities. GAF network fuses the extracted features effectively while preserving the dimensionality of fused feature as well. The proposed MGAF network has structural extensibility and can be unfolded to more than two modalities. Experiments on three publicly available multimodal HAR datasets demonstrate that the proposed MGAF outperforms the previous state-of-the-art fusion methods for depth-inertial HAR in terms of recognition accuracy while being computationally much more efficient. We increase the accuracy by an average of 1.5% while reducing the computational cost by approximately 50% over the previous state-of-art.

[1]  Chalavadi Krishna Mohan,et al.  Human Action Recognition Based on MOCAP Information Using Convolution Neural Networks , 2014, 2014 13th International Conference on Machine Learning and Applications.

[2]  Nasser Kehtarnavaz,et al.  Fusion of depth, skeleton, and inertial data for human action recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Nasser Kehtarnavaz,et al.  Real-time human action recognition based on depth motion maps , 2016, Journal of Real-Time Image Processing.

[4]  Paul J. M. Havinga,et al.  Activity Recognition Using Inertial Sensing for Healthcare, Wellbeing and Sports Applications: A Survey , 2010, ARCS Workshops.

[5]  Limin Wang,et al.  Multi-view Super Vector for Action Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Mohan S. Kankanhalli,et al.  Multimodal fusion for multimedia analysis: a survey , 2010, Multimedia Systems.

[7]  Yuqing Chen,et al.  A Deep Learning Approach to Human Activity Recognition Based on Single Accelerometer , 2015, 2015 IEEE International Conference on Systems, Man, and Cybernetics.

[8]  Wei Liu,et al.  Gated Fusion Network for Single Image Dehazing , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[9]  Max Q.-H. Meng,et al.  A Gait Recognition Method for Human Following in Service Robots , 2018, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[10]  Xinyu Li,et al.  A Survey of Deep Learning-Based Human Activity Recognition in Radar , 2019, Remote. Sens..

[11]  Nasser Kehtarnavaz,et al.  Action Detection and Recognition in Continuous Action Streams by Deep Learning-Based Sensing Fusion , 2018, IEEE Sensors Journal.

[12]  Nasser Kehtarnavaz,et al.  A Real-Time Human Action Recognition System Using Depth and Inertial Sensor Fusion , 2016, IEEE Sensors Journal.

[13]  James W. Davis,et al.  The Recognition of Human Movement Using Temporal Templates , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[14]  Nasser Kehtarnavaz,et al.  Fusion of Video and Inertial Sensing for Deep Learning–Based Human Action Recognition , 2019, Sensors.

[15]  Patrick Olivier,et al.  Feature Learning for Activity Recognition in Ubiquitous Computing , 2011, IJCAI.

[16]  Naimul Mefraz Khan,et al.  Multidomain Multimodal Fusion for Human Action Recognition Using Inertial Sensors , 2019, 2019 IEEE Fifth International Conference on Multimedia Big Data (BigMM).

[17]  Levi J. Hargrove,et al.  A Novel Method for Bilateral Gait Segmentation Using a Single Thigh-Mounted Depth Sensor and IMU , 2018, 2018 7th IEEE International Conference on Biomedical Robotics and Biomechatronics (Biorob).

[18]  Lei Wang,et al.  A Comparative Review of Recent Kinect-Based Action Recognition Algorithms , 2019, IEEE Transactions on Image Processing.

[19]  Ling Bao,et al.  Activity Recognition from User-Annotated Acceleration Data , 2004, Pervasive.

[20]  Nasser Kehtarnavaz,et al.  Real-Time Continuous Detection and Recognition of Subject-Specific Smart TV Gestures via Fusion of Depth and Inertial Sensing , 2018, IEEE Access.

[21]  Youngbae Hwang,et al.  Robust Deep Multi-modal Learning Based on Gated Information Fusion Network , 2018, ACCV.

[22]  Nasser Kehtarnavaz,et al.  A Convolutional Neural Network-Based Sensor Fusion System for Monitoring Transition Movements in Healthcare Applications , 2018, 2018 IEEE 14th International Conference on Control and Automation (ICCA).

[23]  Xiaodong Yang,et al.  Recognizing actions using depth motion maps-based histograms of oriented gradients , 2012, ACM Multimedia.

[24]  Jake K. Aggarwal,et al.  Human activity recognition from 3D data: A review , 2014, Pattern Recognit. Lett..

[25]  Allen Y. Yang,et al.  Distributed recognition of human actions using wearable motion sensor networks , 2009, J. Ambient Intell. Smart Environ..

[26]  Yuning Jiang,et al.  DeepDualMapper: A Gated Fusion Network for Automatic Map Extraction using Aerial Images and Trajectories , 2020, AAAI.

[27]  Xiaohui Peng,et al.  Deep Learning for Sensor-based Activity Recognition: A Survey , 2017, Pattern Recognit. Lett..

[28]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[29]  Fatos T. Yarman-Vural,et al.  Automatic Image Annotation by Ensemble of Visual Descriptors , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Raffaele Limosani,et al.  Enhancing Activity Recognition of Self-Localized Robot Through Depth Camera and Wearable Sensors , 2018, IEEE Sensors Journal.

[31]  Raja Bala,et al.  Deep Temporal Multimodal Fusion for Medical Procedure Monitoring Using Wearable Sensors , 2018, IEEE Transactions on Multimedia.

[32]  Nasser Kehtarnavaz,et al.  Data Augmentation in Deep Learning-Based Fusion of Depth and Inertial Sensing for Action Recognition , 2019, IEEE Sensors Letters.

[33]  Sergio Escalera,et al.  RGB-D-based Human Motion Recognition with Deep Learning: A Survey , 2017, Comput. Vis. Image Underst..

[34]  Bhabatosh Chanda,et al.  Space-Time Facet Model for Human Activity Classification , 2014, IEEE Transactions on Multimedia.

[35]  Nasser Kehtarnavaz,et al.  Simultaneous Utilization of Inertial and Video Sensing for Action Detection and Recognition in Continuous Action Streams , 2020, IEEE Sensors Journal.

[36]  Graham W. Taylor,et al.  Deep Multimodal Learning: A Survey on Recent Advances and Trends , 2017, IEEE Signal Processing Magazine.

[37]  Lihong Zheng,et al.  A Survey on Human Action Recognition Using Depth Sensors , 2015, 2015 International Conference on Digital Image Computing: Techniques and Applications (DICTA).

[38]  James J. Little,et al.  Real-Time Human Motion Capture with Multiple Depth Cameras , 2016, 2016 13th Conference on Computer and Robot Vision (CRV).

[39]  Nasser Kehtarnavaz,et al.  UTD-MHAD: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor , 2015, 2015 IEEE International Conference on Image Processing (ICIP).

[40]  Katsuyuki Sakuma,et al.  Wearable Nail Deformation Sensing for Behavioral and Biomechanical Monitoring and Human-Computer Interaction , 2018, Scientific Reports.

[41]  Yujie Wang,et al.  WiAct: A Passive WiFi-Based Human Activity Recognition System , 2020, IEEE Sensors Journal.

[42]  Wei-Yun Yau,et al.  Human Action Recognition With Video Data: Research and Evaluation Challenges , 2014, IEEE Transactions on Human-Machine Systems.

[43]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[44]  Ruzena Bajcsy,et al.  Berkeley MHAD: A comprehensive Multimodal Human Action Database , 2013, 2013 IEEE Workshop on Applications of Computer Vision (WACV).

[45]  Zeeshan Ahmad,et al.  Human Action Recognition Using Deep Multilevel Multimodal ( ${M}^{2}$ ) Fusion of Depth and Inertial Sensors , 2019, IEEE Sensors Journal.

[46]  Kandasamy Illanko,et al.  Human Action Recognition Using Convolutional Neural Network and Depth Sensor Data , 2019, Proceedings of the 2019 International Conference on Information Technology and Computer Communications - ITCC 2019.

[47]  Zhaozheng Yin,et al.  Human Activity Recognition Using Wearable Sensors by Deep Convolutional Neural Networks , 2015, ACM Multimedia.

[48]  Fatma Kalaoglu,et al.  Human Action Recognition Using Deep Learning Methods on Limited Sensory Data , 2020, IEEE Sensors Journal.

[49]  Naimul Mefraz Khan,et al.  Towards Improved Human Action Recognition Using Convolutional Neural Networks and Multimodal Fusion of Depth and Inertial Sensor Data , 2018, 2018 IEEE International Symposium on Multimedia (ISM).

[50]  Saeed Amirgholipour,et al.  Effect of Locations of Using High Boost Filtering on the Watermark Recovery in Spatial Domain Watermarking , 2014 .

[51]  Nasser Kehtarnavaz,et al.  A survey of depth and inertial sensor fusion for human action recognition , 2015, Multimedia Tools and Applications.

[52]  Wanqing Li,et al.  Action recognition based on a bag of 3D points , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[53]  Nathan Intrator,et al.  A Real-Time Kinect Signature-Based Patient Home Monitoring System , 2016, Sensors.

[54]  Andreas Krause,et al.  Unsupervised, dynamic identification of physiological and activity context in wearable computing , 2003, Seventh IEEE International Symposium on Wearable Computers, 2003. Proceedings..

[55]  Nasser Kehtarnavaz,et al.  Fusion of Inertial and Depth Sensor Data for Robust Hand Gesture Recognition , 2014, IEEE Sensors Journal.

[56]  Jinwen Ma,et al.  DMMs-Based Multiple Features Fusion for Human Action Recognition , 2015, Int. J. Multim. Data Eng. Manag..

[57]  Nasser Kehtarnavaz,et al.  Improving Human Action Recognition Using Fusion of Depth Camera and Inertial Sensors , 2015, IEEE Transactions on Human-Machine Systems.

[58]  Tan-Hsu Tan,et al.  Multi-Resident Activity Recognition in a Smart Home Using RGB Activity Image and DCNN , 2018, IEEE Sensors Journal.

[59]  Esther Rodríguez-Villegas,et al.  Breathing Detection: Towards a Miniaturized, Wearable, Battery-Operated Monitoring System , 2008, IEEE Transactions on Biomedical Engineering.

[60]  Mohamed Atri,et al.  An efficient end-to-end deep learning architecture for activity classification , 2018, Analog Integrated Circuits and Signal Processing.

[61]  Gaddi Blumrosen,et al.  Noncontact Wideband Sonar for Human Activity Detection and Classification , 2014, IEEE Sensors Journal.

[62]  Zeeshan Ahmad,et al.  Human Action Recognition Using Deep Multilevel Multimodal (M2) Fusion of Depth and Inertial Sensors. , 2019 .

[63]  Aun Irtaza,et al.  Robust Human Activity Recognition Using Multimodal Feature-Level Fusion , 2019, IEEE Access.

[64]  Nasser Kehtarnavaz,et al.  C-MHAD: Continuous Multimodal Human Action Dataset of Simultaneous Video and Inertial Sensing , 2020, Sensors.

[65]  Wen Gao,et al.  Event Tactic Analysis Based on Broadcast Sports Video , 2009, IEEE Trans. Multim..

[66]  Alfian Abdul Halin,et al.  Feature selection via dimensionality reduction for object class recognition , 2011, 2011 2nd International Conference on Instrumentation, Communications, Information Technology, and Biomedical Engineering.

[67]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .