论文信息 - Actionness-pooled Deep-convolutional Descriptor for fine-grained action recognition

Actionness-pooled Deep-convolutional Descriptor for fine-grained action recognition

Abstract Recognition of general actions has witnessed great success in recent years. However, the existing general action representations cannot work well to recognize fine-grained actions, which usually share high similarities in both appearance and motion pattern. To solve this problem, we introduce the visual attention mechanism into the proposed descriptor, termed Actionness-pooled Deep-convolutional Descriptor (ADD). Instead of pooling features uniformly from the entire video, we aggregate features in sub-regions that are more likely to contain actions according to actionness maps. This endows ADD with the superior capability of capturing the subtle differences between fine-grained actions. We conduct experiments on HIT Dances dataset, one of the few existing datasets for fine-grained action analysis. Quantitative results have demonstrated that ADD remarkably outperforms traditional CNN-based representations. Extensive experiments on two general action benchmarks, JHMDB and UCF101, have additionally proved that combining ADD with end-to-end ConvNet can further boost the recognition performance. Besides, taking advantage of ADD, we reveal the sparsity characteristic existing in actions and point out a potential direction to design more effective action analysis models by extracting both representative and discriminative action patterns.

[1] Fei-Fei Li,et al. Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[2] Trevor Darrell,et al. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[3] Shu Kong,et al. Low-Rank Bilinear Pooling for Fine-Grained Classification , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4] Qilong Wang,et al. Hyperlayer Bilinear Pooling with application to fine-grained categorization and image retrieval , 2017, Neurocomputing.

[5] Tiejun Huang,et al. Sequential Deep Trajectory Descriptor for Action Recognition With Three-Stream CNN , 2016, IEEE Transactions on Multimedia.

[6] Ke Lu,et al. $p$-Laplacian Regularized Sparse Coding for Human Activity Recognition , 2016, IEEE Transactions on Industrial Electronics.

[7] Thomas Mensink,et al. Image Classification with the Fisher Vector: Theory and Practice , 2013, International Journal of Computer Vision.

[8] Cordelia Schmid,et al. Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[9] Andrew Zisserman,et al. Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[10] Bernt Schiele,et al. Recognizing Fine-Grained and Composite Activities Using Hand-Centric Features and Script Data , 2015, International Journal of Computer Vision.

[11] Gang Sun,et al. A Key Volume Mining Deep Framework for Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12] Luc Van Gool,et al. Actionness Estimation Using Hybrid Fully Convolutional Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13] Bingbing Ni,et al. Annotation modification for fine-grained visual recognition , 2018, Neurocomputing.

[14] Trevor Darrell,et al. Part-Based R-CNNs for Fine-Grained Category Detection , 2014, ECCV.

[15] Luc Van Gool,et al. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[16] Zhou Yu,et al. User-Click-Data-Based Fine-Grained Image Recognition via Weakly Supervised Metric Learning , 2018, ACM Trans. Multim. Comput. Commun. Appl..

[17] Barbara Caputo,et al. Recognizing human actions: a local SVM approach , 2004, ICPR 2004.

[18] Andrew Zisserman,et al. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19] Cordelia Schmid,et al. Towards Understanding Action Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[20] Lorenzo Torresani,et al. Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[21] Pietro Perona,et al. Bird Species Categorization Using Pose Normalized Deep Convolutional Nets , 2014, ArXiv.

[22] Jitendra Malik,et al. Contextual Action Recognition with R*CNN , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[23] Jun Yu,et al. Multimodal Face-Pose Estimation With Multitask Manifold Deep Learning , 2019, IEEE Transactions on Industrial Informatics.

[24] Rob Fergus,et al. Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[25] Jitendra Malik,et al. Finding action tubes , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26] Bernt Schiele,et al. A database for fine grained activity detection of cooking activities , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[27] Bingbing Ni,et al. Interaction part mining: A mid-level approach for fine-grained action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28] Deva Ramanan,et al. Attentional Pooling for Action Recognition , 2017, NIPS.

[29] Chenliang Xu,et al. Dancelets Mining for Video Recommendation Based on Dance Styles , 2017, IEEE Transactions on Multimedia.

[30] Jun Yu,et al. Multitask Autoencoder Model for Recovering Human Poses , 2018, IEEE Transactions on Industrial Electronics.

[31] Tao Mei,et al. Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[32] Feng Zhou,et al. Embedding Label Structures for Fine-Grained Feature Representation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33] Andrea Vedaldi,et al. Dynamic Image Networks for Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34] Cordelia Schmid,et al. Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[35] Ming Shao,et al. A Multi-stream Bi-directional Recurrent Neural Network for Fine-Grained Action Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36] Cordelia Schmid,et al. Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[37] Qi Tian,et al. Fine-Grained Image Search , 2015, IEEE Transactions on Multimedia.

[38] Wei Chen,et al. Actionness Ranking with Lattice Conditional Ordinal Random Fields , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[39] Yu Qiao,et al. Action Recognition with Stacked Fisher Vectors , 2014, ECCV.

[40] Jianping Fan,et al. iPrivacy: Image Privacy Protection by Identifying Sensitive Objects via Deep Multi-Task Learning , 2017, IEEE Transactions on Information Forensics and Security.

[41] Jianping Fan,et al. Fine-grained image recognition via weakly supervised click data guided bilinear CNN model , 2017, 2017 IEEE International Conference on Multimedia and Expo (ICME).

[42] Cordelia Schmid,et al. P-CNN: Pose-Based CNN Features for Action Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[43] Philip S. Yu,et al. Spatiotemporal Pyramid Network for Video Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44] Jianping Fan,et al. Leveraging Content Sensitiveness and User Trustworthiness to Recommend Fine-Grained Privacy Settings for Social Image Sharing , 2018, IEEE Transactions on Information Forensics and Security.

[45] Bingbing Ni,et al. Progressively Parsing Interactional Objects for Fine Grained Action Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46] Sridha Sridharan,et al. Two Stream LSTM: A Deep Fusion Framework for Human Action Recognition , 2017, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).

[47] Ran Xu,et al. Human action segmentation with hierarchical supervoxel consistency , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48] Christian Bauckhage,et al. Efficient Pose-Based Action Recognition , 2014, ACCV.

[49] Limin Wang,et al. Action recognition with trajectory-pooled deep-convolutional descriptors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50] Yoshua Bengio,et al. Fine-grained attention mechanism for neural machine translation , 2018, Neurocomputing.

[51] Marcus Hutter,et al. Discriminative Hierarchical Rank Pooling for Activity Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52] Subhransu Maji,et al. Bilinear CNN Models for Fine-Grained Visual Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[53] Limin Wang,et al. Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice , 2014, Comput. Vis. Image Underst..

[54] Cordelia Schmid,et al. Long-Term Temporal Convolutions for Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[55] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[56] Luc Van Gool,et al. Action snippets: How many frames does human action recognition require? , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.