A joint model for action localization and classification in untrimmed video with visual attention

In this paper, we introduce a joint model that learns to directly localize the temporal bounds of actions in untrimmed videos as well as precisely classify what actions occur. Most existing approaches tend to scan the whole video to generate action instances, which are really inefficient. Instead, inspired by human perception, our model is formulated based on a recurrent neural network to observe different locations within a video over time. And, it is capable of producing temporal localizations by only observing a fixed number of fragments, and the amount of computation it performs is independent of input video size. The decision policy for determining where to look next is learned by REINFORCE which is powerful in non-differentiable settings. In addition, different from relevant ways, our model runs localization and classification serially, and possesses a strategy for extracting appropriate features to classify. We evaluate our model on ActivityNet dataset, and it greatly outperforms the baseline. Moreover, compared with a recent approach, we show that our serial design can bring about 9% increase in detection performance.

[1]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[2]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Xiaoou Tang,et al.  Action Recognition and Detection by Combining Motion and Appearance Features , 2014 .

[4]  Limin Wang,et al.  Action recognition with trajectory-pooled deep-convolutional descriptors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  R. J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[6]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Koray Kavukcuoglu,et al.  Visual Attention , 2020, Computational Models for Cognitive Vision.

[8]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[9]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[10]  Ruslan Salakhutdinov,et al.  Action Recognition using Visual Attention , 2015, NIPS 2015.

[11]  Luc Van Gool,et al.  An Efficient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector , 2008, ECCV.

[12]  Bernard Ghanem,et al.  ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Alex Graves,et al.  Recurrent Models of Visual Attention , 2014, NIPS.

[14]  Cordelia Schmid,et al.  Evaluation of Local Spatio-temporal Features for Action Recognition , 2009, BMVC.

[15]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[16]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[17]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[18]  Li Fei-Fei,et al.  End-to-End Learning of Action Detection from Frame Glimpses in Videos , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Pierre Sermanet,et al.  Attention for Fine-Grained Categorization , 2014, ICLR.

[20]  Wen Gao,et al.  Deep Alternative Neural Network: Exploring Contexts as Early as Possible for Action Recognition , 2016, NIPS.

[21]  Fabio Cuzzolin,et al.  Untrimmed Video Classification for Activity Detection: submission to ActivityNet Challenge , 2016, ArXiv.