Towards a Fair Evaluation of Zero-Shot Action Recognition Using External Data

Zero-shot action recognition aims to classify actions not previously seen during training. This is achieved by learning a visual model for the seen source classes and establishing a semantic relationship to the unseen target classes e.g. through the action labels. In order to draw a clear line between zero-shot and conventional supervised classification, the source and target categories must be disjoint. Ensuring this premise is not trivial, especially when the source dataset is external. In this work, we propose an evaluation procedure that enables fair use of external data for zero-shot action recognition. We empirically show that external sources tend to have actions excessively similar to the target classes, strongly influencing the performance and violating the zero-shot premise. To address this, we propose a corrective method to automatically filter out too similar categories by exploiting the pairwise intra-dataset similarity of the labels. Our experiments on the HMDB-51 dataset demonstrate that the zero-shot models consistently benefit from the external sources even under our realistic evaluation, especially when the source categories of internal and external domains are combined.

[1]  Kaiming He,et al.  Data Distillation: Towards Omni-Supervised Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[2]  Ke Chen,et al.  Zero-Shot Visual Recognition via Bidirectional Latent Embedding , 2016, International Journal of Computer Vision.

[3]  Kate Saenko,et al.  Guest Editorial: Image and Language Understanding , 2017, International Journal of Computer Vision.

[4]  Bernt Schiele,et al.  Zero-Shot Learning — The Good, the Bad and the Ugly , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Ling Shao,et al.  Towards Universal Representation for Unseen Action Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[7]  Samy Bengio,et al.  Zero-Shot Learning by Convex Combination of Semantic Embeddings , 2013, ICLR.

[8]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[9]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Bingbing Ni,et al.  Zero-Shot Action Recognition with Error-Correcting Output Codes , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[12]  Bernard Ghanem,et al.  ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Ronald Poppe,et al.  A survey on vision-based human action recognition , 2010, Image Vis. Comput..

[14]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[15]  Shaogang Gong,et al.  Semantic embedding space for zero-shot action recognition , 2015, 2015 IEEE International Conference on Image Processing (ICIP).

[16]  Rainer Stiefelhagen,et al.  Informed Democracy: Voting-based Novelty Detection for Action Recognition , 2018, BMVC.

[17]  Xun Xu,et al.  Transductive Zero-Shot Action Recognition by Word-Vector Embedding , 2015, International Journal of Computer Vision.

[18]  Alois Knoll,et al.  Multimodal Human Activity Recognition for Industrial Manufacturing Processes in Robotic Workcells , 2015, ICMI.

[19]  Mohan M. Trivedi,et al.  Looking at Humans in the Age of Self-Driving and Highly Automated Vehicles , 2016, IEEE Transactions on Intelligent Vehicles.