Heterogeneous Non-Local Fusion for Multimodal Activity Recognition

In this work, we investigate activity recognition using multimodal inputs from heterogeneous sensors. Activity recognition is commonly tackled from a single-modal perspective using videos. In case multiple signals are used, they come from the same homogeneous modality, e.g. in the case of color and optical flow. Here, we propose an activity network that fuses multimodal inputs coming from completely different and heterogeneous sensors. We frame such a heterogeneous fusion as a non-local operation. The observation is that in a non-local operation, only the channel dimensions need to match. In the network, heterogeneous inputs are fused, while maintaining the shapes and dimensionalities that fit each input. We outline both asymmetric fusion, where one modality serves to enforce the other, and symmetric fusion variants. To further promote research into multimodal activity recognition, we introduce GloVid, a first-person activity dataset captured with video recordings and smart glove sensor readings. Experiments on GloVid show the potential of heterogeneous non-local fusion for activity recognition, outperforming individual modalities and standard fusion techniques.

[1]  Ciprian Dobre,et al.  Human Physical Activity Recognition Using Smartphone Sensors , 2019, Sensors.

[2]  Yu-Gang Jiang,et al.  Embodied One-Shot Video Recognition: Learning from Actions of a Virtual Embodied Agent , 2019, ACM Multimedia.

[3]  Cordelia Schmid,et al.  Long-Term Temporal Convolutions for Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[5]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[6]  Richard P. Wildes,et al.  Spatiotemporal Residual Networks for Video Action Recognition , 2016, NIPS.

[7]  Frédéric Jurie,et al.  Temporal multimodal fusion for video emotion classification in the wild , 2017, ICMI.

[8]  Bolei Zhou,et al.  Moments in Time Dataset: One Million Videos for Event Understanding , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Yadong Mu,et al.  Two-Stream Video Classification with Cross-Modality Attention , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[10]  Musa Peker,et al.  Human activity recognition from smart watch sensor data using a hybrid of principal component analysis and random forest algorithm , 2018, Measurement and Control.

[11]  Marco Morana,et al.  User Activity Recognition via Kinect in an Ambient Intelligence Scenario , 2014 .

[12]  Hanghang Tong,et al.  Activity recognition with smartphone sensors , 2014 .

[13]  Jianfeng Zhao,et al.  Speech emotion recognition using deep 1D & 2D CNN LSTM networks , 2019, Biomed. Signal Process. Control..

[14]  Zhongmin Wang,et al.  Human Activity Recognition Model Based on Decision Tree , 2013, 2013 International Conference on Advanced Cloud and Big Data.

[15]  Jitendra Malik,et al.  SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[16]  ChaaraouiAlexandros Andre,et al.  Evolutionary joint selection to improve human action recognition with RGB-D devices , 2014 .

[17]  Albert Ali Salah,et al.  Video-based emotion recognition in the wild using deep transfer learning and score fusion , 2017, Image Vis. Comput..

[18]  Richard P. Wildes,et al.  Spatiotemporal Multiplier Networks for Video Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[21]  Yann LeCun,et al.  A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[23]  Yutaka Arakawa,et al.  A Smart Glove to Track Fitness Exercises by Reading Hand Palm , 2019, J. Sensors.

[24]  Pekka Siirtola,et al.  Activity recognition using a wrist-worn inertial measurement unit: A case study for industrial assembly lines , 2009, 2009 17th Mediterranean Conference on Control and Automation.

[25]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Patrick Robertson,et al.  Bayesian recognition of motion related activities with inertial sensors , 2010, UbiComp '10 Adjunct.

[27]  Limin Wang,et al.  Appearance-and-Relation Networks for Video Classification , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[28]  Vikrambhai S. Sorathia,et al.  Sensors to Events: Semantic Modeling and Recognition of Events from Data Streams , 2016, Int. J. Semantic Comput..

[29]  James M. Rehg,et al.  A Scalable Approach to Activity Recognition based on Object Use , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[30]  Yutaka Satoh,et al.  Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31]  Hicham Hadj-Abdelkader,et al.  3D Human Tracking with Catadioptric Omnidirectional Camera , 2019, ICMR.

[32]  Yi Zhang,et al.  Prediction of Manipulation Actions , 2016, International Journal of Computer Vision.

[33]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34]  Frédéric Jurie,et al.  Multilevel Sensor Fusion With Deep Learning , 2018, IEEE Sensors Letters.

[35]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Chen Sun,et al.  Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification , 2017, ECCV.

[37]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Kang Zheng,et al.  Recognizing Actions in Wearable-Camera Videos by Training Classifiers on Fixed-Camera Videos , 2018, ICMR.

[39]  Luc Van Gool,et al.  An Efficient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector , 2008, ECCV.

[40]  Heiko Schuldt,et al.  Multimodal Multimedia Retrieval with vitrivr , 2019, ICMR.

[41]  Michael J. Black,et al.  On the Integration of Optical Flow and Action Recognition , 2017, GCPR.

[42]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[43]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[44]  Cordelia Schmid,et al.  Evaluation of Local Spatio-temporal Features for Action Recognition , 2009, BMVC.

[45]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[46]  Veralia Gabriela Sanchez,et al.  Decision Trees for Human Activity Recognition in Smart House Environments , 2018 .

[47]  Meng Li,et al.  A Random Forest-based ensemble method for activity recognition , 2015, 2015 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC).

[48]  Yu Zhao,et al.  Deep Residual Bidir-LSTM for Human Activity Recognition Using Wearable Sensors , 2017, Mathematical Problems in Engineering.

[49]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[50]  Gordon Cheng,et al.  On-line simultaneous learning and recognition of everyday activities from virtual reality performances , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[51]  Kaiming He,et al.  Focal Loss for Dense Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[52]  Bernard Ghanem,et al.  ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[54]  Kevin Bouchard,et al.  Basic Daily Activity Recognition with a Data Glove , 2019, ANT/EDI40.

[55]  Iakovos S. Venieris,et al.  PerceptionNet: A Deep Convolutional Neural Network for Late Sensor Fusion , 2018, IntelliSys.

[56]  Simon A. Dobson,et al.  Detecting abnormal events on binary sensors in smart home environments , 2016, Pervasive Mob. Comput..

[57]  Sang Min Yoon,et al.  Divide and Conquer-Based 1D CNN Human Activity Recognition Using Test Data Sharpening † , 2018, Sensors.

[58]  Jesse Hoey,et al.  Sensor-Based Activity Recognition , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[59]  Daniel Roggen,et al.  Deep Convolutional and LSTM Recurrent Neural Networks for Multimodal Wearable Activity Recognition , 2016, Sensors.

[60]  Dror Porat,et al.  Optimally Grouped Deep Features Using Normalized Cost for Video Scene Detection , 2018, ICMR.

[61]  Marek B. Zaremba,et al.  Wearable Sensor Data Classification for Human Activity Recognition Based on an Iterative Learning Framework † , 2017, Sensors.

[62]  Tomokazu Murakami Industrial Applications of Image Recognition and Retrieval Technologies for Public Safety and IT Services , 2018, ICMR.

[63]  Laurent Girin,et al.  Audio-Visual Variational Fusion for Multi-Person Tracking with Robots , 2019, ACM Multimedia.

[64]  R. Rodrigo,et al.  Faster human activity recognition with SVM , 2012, International Conference on Advances in ICT for Emerging Regions (ICTer2012).