Loss Guided Activation for Action Recognition in Still Images

One significant problem of deep-learning based human action recognition is that it can be easily misled by the presence of irrelevant objects or backgrounds. Existing methods commonly address this problem by employing bounding boxes on the target humans as part of the input, in both training and testing stages. This requirement of bounding boxes as part of the input is needed to enable the methods to ignore irrelevant contexts and extract only human features. However, we consider this solution is inefficient, since the bounding boxes might not be available. Hence, instead of using a person bounding box as an input, we introduce a human-mask loss to automatically guide the activations of the feature maps to the target human who is performing the action, and hence suppress the activations of misleading contexts. We propose a multi-task deep learning method that jointly predicts the human action class and human location heatmap. Extensive experiments demonstrate our approach is more robust compared to the baseline methods under the presence of irrelevant misleading contexts. Our method achieves 94.06% and 40.65% (in terms of mAP) on Stanford40 and MPII dataset respectively, which are 3.14% and 12.6% relative improvements over the best results reported in the literature, and thus set new state-of-the-art results. Additionally, unlike some existing methods, we eliminate the requirement of using a person bounding box as an input during testing.

[1]  Svetlana Lazebnik,et al.  Learning Models for Actions and Person-Object Interactions with Transfer to Question Answering , 2016, ECCV.

[2]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[3]  Jitendra Malik,et al.  Contextual Action Recognition with R*CNN , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[4]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[6]  Kaiming He,et al.  Detecting and Recognizing Human-Object Interactions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[7]  Cordelia Schmid,et al.  Expanded Parts Model for Semantic Description of Humans in Still Images , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Leonidas J. Guibas,et al.  Human action recognition by learning bases of action attributes and parts , 2011, 2011 International Conference on Computer Vision.

[9]  Jitendra Malik,et al.  Visual Semantic Role Labeling , 2015, ArXiv.

[10]  Subhransu Maji,et al.  Action recognition from a distributed representation of pose and appearance , 2011, CVPR 2011.

[11]  Jianfei Cai,et al.  Action Recognition in Still Images With Minimum Annotation Efforts , 2016, IEEE Transactions on Image Processing.

[12]  Jitendra Malik,et al.  Actions and Attributes from Wholes and Parts , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[13]  Koen E. A. van de Sande,et al.  Selective Search for Object Recognition , 2013, International Journal of Computer Vision.

[14]  Rama Chellappa,et al.  HyperFace: A Deep Multi-Task Learning Framework for Face Detection, Landmark Localization, Pose Estimation, and Gender Recognition , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[16]  Bernt Schiele,et al.  2D Human Pose Estimation: New Benchmark and State of the Art Analysis , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Minh Hoai,et al.  Regularized Max Pooling for Image Categorization , 2014, BMVC.

[18]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Deva Ramanan,et al.  Attentional Pooling for Action Recognition , 2017, NIPS.

[20]  Yu Zhang,et al.  Exploit Bounding Box Annotations for Multi-Label Object Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Huimin Ma,et al.  Single Image Action Recognition Using Semantic Body Part Actions , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[22]  Ian D. Reid,et al.  Towards Context-Aware Interaction Recognition for Visual Relationship Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[23]  Ivan Laptev,et al.  Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Fei-Fei Li,et al.  Action Recognition with Exemplar Based 2.5D Graph Matching , 2012, ECCV.

[25]  Deva Ramanan,et al.  Detecting Actions, Poses, and Objects with Relational Phraselets , 2012, ECCV.

[26]  Bernt Schiele,et al.  Fine-Grained Activity Recognition with Holistic and Pose Based Features , 2014, GCPR.

[27]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.