Better deep visual attention with reinforcement learning in action recognition

Deep visual attention in computer vision has attracted much attention over the past years, which achieves great contributions especially in image classification, image caption and action recognition. However, due to taking BP training wholly or partially, they can not show the true power of attention in computational efficiency and focusing accuracy. Our intuition is that attention mechanism should be similar to the process in which human draw attention and select the next location to focus, by observing, analyzing and jumping instead of existing describing continuous features. Based on this insight, we formulate our model as a recurrent neural network-based agent that chooses attention region by reinforcement learning at each timestep. In experiments, our model explicitly outperforms baselines not only in focusing and recognizing accuracy, but also consumes much less computational resources, which can be honored as better deep visual attention.

[1]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Ruslan Salakhutdinov,et al.  Action Recognition using Visual Attention , 2015, NIPS 2015.

[3]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[4]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[5]  Xiaoou Tang,et al.  Action Recognition and Detection by Combining Motion and Appearance Features , 2014 .

[6]  Elsayed E. Hemayed,et al.  Human action recognition using trajectory-based representation , 2015 .

[7]  Alex Graves,et al.  Recurrent Models of Visual Attention , 2014, NIPS.

[8]  Baoxin Li,et al.  Recognizing unseen actions in a domain-adapted embedding space , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[9]  Li Fei-Fei,et al.  End-to-End Learning of Action Detection from Frame Glimpses in Videos , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Limin Wang,et al.  Action recognition with trajectory-pooled deep-convolutional descriptors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Alex Graves,et al.  DRAW: A Recurrent Neural Network For Image Generation , 2015, ICML.

[12]  Yu Qiao,et al.  Action Recognition with Stacked Fisher Vectors , 2014, ECCV.

[13]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[15]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).