HUMAN ACTIVITY CLASSIFICATION INCORPORATING EGOCENTRIC VIDEO AND INERTIAL MEASUREMENT UNIT DATA

Many methods have been proposed for human activity classification, which rely either on Inertial Measurement Unit (IMU) data or data from static cameras watching subjects. There have been relatively less work using egocentric videos, and even fewer approaches combining egocentric video and IMU data. Systems relying only on IMU data are limited in the complexity of the activities that they can detect. In this paper, we present a robust and autonomous method, for fine-grained activity classification, that leverages data from multiple wearable sensor modalities to differentiate between activities, which are similar in nature, with a level of accuracy that would be impossible by each sensor alone. We use both egocentric videos and IMU sensors on the body. We employ Capsule Networks together with Convolutional Long Short Term Memory (LSTM) to analyze egocentric videos, and an LSTM framework to analyze IMU data, and capture temporal aspect of actions. We performed experiments on the CMU-MMAC dataset achieving overall recall and precision rates of 85.8% and 86.2%, respectively. We also present results of using each sensor modality alone, which show that the proposed approach provides 19.47% and 39.34% increase in accuracy compared to using only ego-vision data and only IMU data, respectively.

[1]  Dit-Yan Yeung,et al.  Deep Learning for Precipitation Nowcasting: A Benchmark and A New Model , 2017, NIPS.

[2]  Larry H. Matthies,et al.  First-Person Activity Recognition: What Are They Doing to Me? , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Martial Hebert,et al.  Temporal segmentation and activity classification from first-person sensing , 2009, 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[4]  Yong Jae Lee,et al.  Discovering important people and objects for egocentric video summarization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  James M. Rehg,et al.  Learning to Predict Gaze in Egocentric Video , 2013, 2013 IEEE International Conference on Computer Vision.

[6]  Bernard Ghanem,et al.  ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[8]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[9]  Geoffrey E. Hinton,et al.  Dynamic Routing Between Capsules , 2017, NIPS.

[10]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[11]  Jean-Christophe Nebel,et al.  Recognition of Activities of Daily Living with Egocentric Vision: A Review , 2016, Sensors.

[12]  Ana Cristina Murillo,et al.  Experiments on an RGB-D Wearable Vision System for Egocentric Activity Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[13]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Yongcai Wang,et al.  Health sensing by wearable sensors and mobile phones: A survey , 2014, 2014 IEEE 16th International Conference on e-Health Networking, Applications and Services (Healthcom).

[15]  Jessica K. Hodgins,et al.  Guide to the Carnegie Mellon University Multimodal Activity (CMU-MMAC) Database , 2008 .

[16]  James M. Rehg,et al.  Learning to recognize objects in egocentric activities , 2011, CVPR 2011.

[17]  Michele Magno,et al.  Accelerated Visual Context Classification on a Low-Power Smartwatch , 2017, IEEE Transactions on Human-Machine Systems.

[18]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Bernard Ghanem,et al.  SST: Single-Stream Temporal Action Proposals , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Cheng Li,et al.  Pixel-Level Hand Detection in Ego-centric Videos , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Kristen Grauman,et al.  Object-Centric Spatio-Temporal Pyramids for Egocentric Activity Recognition , 2013, BMVC.

[22]  Daniel Roggen,et al.  Deep Convolutional and LSTM Recurrent Neural Networks for Multimodal Wearable Activity Recognition , 2016, Sensors.

[23]  Takahiro Okabe,et al.  Fast unsupervised ego-action learning for first-person sports videos , 2011, CVPR 2011.

[24]  Alberto Montes Gómez Temporal activity detection in untrimmed videos with recurrent neural networks , 2016 .

[25]  Ali Farhadi,et al.  Action Recognition in the Presence of One Egocentric and Multiple Static Cameras , 2014, ACCV.