Using External Knowledge to Improve Zero-Shot Action Recognition in Egocentric Videos

Zero-shot learning is a very promising research topic. For a vision-based action recognition system, for instance, zero-shot learning allows to recognise actions never seen during the training phase. Previous works in zero-shot action recognition have exploited in several ways the visual appearance of input videos to infer actions. Here, we propose to add external knowledge to improve the performance of purely vision-based systems. Specifically, we have explored three different sources of knowledge in the form of text corpora. Our resulting system follows the literature and disentangles actions into verbs and objects. In particular, we independently train two vision-based detectors: (i) a verb detector and (ii) an active object detector. During inference, we combine the probability distributions generated from those detectors to obtain a probability distribution of actions. Finally, the vision-based estimation is further combined with an action prior extracted from text corpora (external knowledge). We evaluate our approach on the EGTEA Gaze+ dataset, an Egocentric Action Recognition dataset, demonstrating that the use of external knowledge improves the recognition of actions never seen by the detectors.

[1]  Costin Badica,et al.  A Review on Vision Surveillance Techniques in Smart Home Environments , 2013, 2013 19th International Conference on Control Systems and Computer Science.

[2]  Trevor Darrell,et al.  YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[3]  T deCampos,et al.  A survey on computer vision tools for action recognition, crowd surveillance and suspect retrieval , 2014 .

[4]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[6]  Sven Bambach A Survey on Recent Advances of Computer Vision Algorithms for Egocentric Video , 2015, ArXiv.

[7]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Dit-Yan Yeung,et al.  Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting , 2015, NIPS.

[9]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Jean-Christophe Nebel,et al.  Recognition of Activities of Daily Living with Egocentric Vision: A Review , 2016, Sensors.

[11]  James M. Rehg,et al.  First-Person Action Decomposition and Zero-Shot Learning , 2017, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).

[12]  Zelun Luo,et al.  Vision-Based Approach to Senior Healthcare : Depth-Based Activity Recognition with Convolutional Neural Networks , 2017 .

[13]  Li Fei-Fei,et al.  Scaling Human-Object Interaction Recognition Through Zero-Shot Learning , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[14]  Oswald Lanz,et al.  Attention is All We Need: Nailing Down Object-centric Attention for Egocentric Activity Recognition , 2018, BMVC.

[15]  Li-Jia Li,et al.  Vision-Based Descriptive Analytics of Seniors ’ Daily Activities for Long-Term Health Monitoring , 2018 .

[16]  Andrea Cavallaro,et al.  Visual features for ego-centric activity recognition: a survey , 2018, WearSys@MobiSys.

[17]  Andreas Dengel,et al.  Hierarchical Model for Zero-shot Activity Recognition using Wearable Sensors , 2018, ICAART.

[18]  Sergio Escalera,et al.  LSTA: Long Short-Term Attention for Egocentric Action Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  David Menotti,et al.  Zero-Shot Action Recognition in Videos: A Survey , 2019, Neurocomputing.