Uncovering Human Multimodal Activity Recognition with a Deep Learning Approach

Recent breakthroughs on deep learning and computer vision have encouraged the use of multimodal human activity recognition aiming at applications in human-robot-interaction. The wide availability of videos at online platforms has made this modality one of the most promising for this task, whereas some researchers have tried to enhance the video data with wearable sensors attached to human subjects. However, temporal information on both video and inertial sensors are still under investigation. Most of the current work focusing on daily activities do not present comparative studies considering different temporal approaches. In this paper, we are proposing a new model build upon a Two-Stream ConvNet for action recognition, enhanced with Long Short-Term Memory (LSTM) and a Temporal Convolution Networks (TCN) to investigate the temporal information on videos and inertial sensors. A feature-level fusion approach prior to temporal modelling is also proposed and evaluated. Experiments have been conducted on the egocentric multimodal dataset and on the UTD-MHAD. LSTM and TCN showed competitive results, with the TCN performing slightly better for most applications. The feature-level fusion approach also performed well on the UTD-MHAD with some overfitting on the egocentric multimodal dataset. Overall the proposed model presented promising results on both datasets compatible with the state-of-the-art, providing insights on the use of deep learning for human-robot-interaction applications.

[1]  Thomas Plötz,et al.  Deep, Convolutional, and Recurrent Models for Human Activity Recognition Using Wearables , 2016, IJCAI.

[2]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Jacob T. Browne Wizard of Oz Prototyping for Machine Learning Experiences , 2019, CHI Extended Abstracts.

[4]  Mehrtash Tafazzoli Harandi,et al.  Going deeper into action recognition: A survey , 2016, Image Vis. Comput..

[5]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[6]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[7]  Colin G. Johnson,et al.  A Situation-Aware Fear Learning (SAFEL) model for robots , 2017, Neurocomputing.

[8]  Nasser Kehtarnavaz,et al.  UTD-MHAD: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor , 2015, 2015 IEEE International Conference on Image Processing (ICIP).

[9]  Andrew Zisserman,et al.  Deep Insights into Convolutional Networks for Video Recognition , 2019, International Journal of Computer Vision.

[10]  Vladlen Koltun,et al.  Convolutional Sequence Modeling Revisited , 2018, ICLR.

[11]  Matthias Hein,et al.  Variants of RMSProp and Adagrad with Logarithmic Regret Bounds , 2017, ICML.

[12]  Gunnar Farnebäck,et al.  Two-Frame Motion Estimation Based on Polynomial Expansion , 2003, SCIA.

[13]  Joo-Hwee Lim,et al.  Multimodal Multi-Stream Deep Learning for Egocentric Activity Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[14]  Jie Lin,et al.  Egocentric activity recognition with multimodal fisher vector , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  João Gama,et al.  Human Activity Recognition Using Inertial Sensors in a Smartphone: An Overview , 2019, Sensors.

[16]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[17]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[18]  Mattias Jacobsson,et al.  Advocating an ethical memory model for artificial companions from a human-centred perspective , 2011, AI & SOCIETY.

[19]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Carsten Zoll,et al.  The Social Role of Robots in the Future—Explorative Measurement of Hopes and Fears , 2011, Int. J. Soc. Robotics.

[21]  Xi Wang,et al.  Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification , 2015, ACM Multimedia.

[22]  Nasser Kehtarnavaz,et al.  A Real-Time Human Action Recognition System Using Depth and Inertial Sensor Fusion , 2016, IEEE Sensors Journal.

[23]  Nasser Kehtarnavaz,et al.  A survey of depth and inertial sensor fusion for human action recognition , 2015, Multimedia Tools and Applications.

[24]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[26]  Stephen J. McKenna,et al.  Combining embedded accelerometers with computer vision for recognizing food preparation activities , 2013, UbiComp.

[27]  Daniel Roggen,et al.  Deep Convolutional and LSTM Recurrent Neural Networks for Multimodal Wearable Activity Recognition , 2016, Sensors.

[28]  Ricardo Chavarriaga,et al.  The Opportunity challenge: A benchmark database for on-body sensor-based activity recognition , 2013, Pattern Recognit. Lett..

[29]  Chen Sun,et al.  Webly-Supervised Video Recognition by Mutually Voting for Relevant Web Images and Web Video Frames , 2016, ECCV.

[30]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[31]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[32]  Cheng Xu,et al.  InnoHAR: A Deep Neural Network for Complex Human Activity Recognition , 2019, IEEE Access.

[33]  Horst Bischof,et al.  A Duality Based Approach for Realtime TV-L1 Optical Flow , 2007, DAGM-Symposium.

[34]  Hermann Ney,et al.  From Feedforward to Recurrent LSTM Neural Networks for Language Modeling , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[35]  Min Tan,et al.  Sequential learning for multimodal 3D human activity recognition with Long-Short Term Memory , 2017, 2017 IEEE International Conference on Mechatronics and Automation (ICMA).

[36]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[37]  Didier Stricker,et al.  Introducing a New Benchmarked Dataset for Activity Monitoring , 2012, 2012 16th International Symposium on Wearable Computers.

[38]  Jo Ueyama,et al.  Exploiting the Use of Convolutional Neural Networks for Localization in Indoor Environments , 2017, Appl. Artif. Intell..

[39]  Gernot A. Fink,et al.  Learning Attribute Representation for Human Activity Recognition , 2018, 2018 24th International Conference on Pattern Recognition (ICPR).