A Study on Action Detection in the Wild

The recent introduction of the AVA dataset for action detection has caused a renewed interest to this problem. Several approaches have been recently proposed that improved the performance. However, all of them have ignored the main difficulty of the AVA dataset - its realistic distribution of training and test examples. This dataset was collected by exhaustive annotation of human action in uncurated videos. As a result, the most common categories, such as `stand' or `sit', contain tens of thousands of examples, whereas rare ones have only dozens. In this work we study the problem of action detection in a highly-imbalanced dataset. Differently from previous work on handling long-tail category distributions, we begin by analyzing the imbalance in the test set. We demonstrate that the standard AP metric is not informative for the categories in the tail, and propose an alternative one - Sampled AP. Armed with this new measure, we study the problem of transferring representations from the data-rich head to the rare tail categories and propose a simple but effective approach.

[1]  Yang Song,et al.  Large Scale Fine-Grained Categorization and Domain-Specific Transfer Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[2]  Cordelia Schmid,et al.  Human Focused Action Localization in Video , 2010, ECCV Workshops.

[3]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Cordelia Schmid,et al.  Multi-region Two-Stream R-CNN for Action Detection , 2016, ECCV.

[5]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[6]  Cordelia Schmid,et al.  Explicit Modeling of Human-Object Interactions in Realistic Videos , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Cordelia Schmid,et al.  AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[8]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[9]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[10]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Тараса Шевченка,et al.  Quo vadis? , 2013, Clinical chemistry.

[12]  Suman Saha,et al.  Deep Learning for Detecting Multiple Space-Time Action Tubes in Videos , 2016, BMVC.

[13]  Jitendra Malik,et al.  SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[14]  Cordelia Schmid,et al.  Action Tubelet Detector for Spatio-Temporal Action Localization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[15]  Ron Kohavi,et al.  The Case against Accuracy Estimation for Comparing Induction Algorithms , 1998, ICML.

[16]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[17]  Longbing Cao,et al.  Training deep neural networks on imbalanced data sets , 2016, 2016 International Joint Conference on Neural Networks (IJCNN).

[18]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[20]  Bolei Zhou,et al.  Open Vocabulary Scene Parsing , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[21]  Chen Sun,et al.  D3D: Distilled 3D Networks for Video Action Recognition , 2018, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[22]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[23]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[24]  Cordelia Schmid,et al.  A Structured Model for Action Detection , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Martial Hebert,et al.  Learning to Model the Tail , 2017, NIPS.

[26]  Suman Saha,et al.  Online Real-Time Multiple Spatiotemporal Action Localisation and Prediction , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[27]  Rui Hou,et al.  Tube Convolutional Neural Network (T-CNN) for Action Detection in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[28]  Jitendra Malik,et al.  Finding action tubes , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Charles X. Ling,et al.  Data Mining for Direct Marketing: Problems and Solutions , 1998, KDD.

[30]  Junmo Kim,et al.  Deep Pyramidal Residual Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Cordelia Schmid,et al.  Learning to Track for Spatio-Temporal Action Localization , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[32]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[33]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[35]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[36]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[37]  Andrew Zisserman,et al.  Video Action Transformer Network , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).