Action recognition has been an active research area for many years. Extracting discriminative spatial and temporal features of different actions plays a key role in accomplishing this task. Current popular methods of action recognition are mainly based on two-stream Convolutional Networks (ConvNets) or 3D ConvNets. However, the computational cost of two-stream ConvNets is high for the requirement of optical flow while 3D ConvNets takes too much memory because they have a large amount of parameters. To alleviate such problems, the authors propose a Dual Attention ConvNet (DANet) based on dual attention mechanism which consists of spatial attention and temporal attention. The former concentrates on main motion objects in a video frame by using ConvNet structure and the latter captures related information of multiple video frames by adopting self-attention. Their network is entirely based on 2D ConvNet and takes in only RGB frames. Experimental results on UCF-101 and HMDB-51 benchmarks demonstrate that DANet gets comparable results among leading methods, which proves the effectiveness of the dual attention mechanism.
[1]
Jürgen Schmidhuber,et al.
Long Short-Term Memory
,
1997,
Neural Computation.
[2]
Ming Yang,et al.
3D Convolutional Neural Networks for Human Action Recognition
,
2010,
IEEE Transactions on Pattern Analysis and Machine Intelligence.
[3]
Lihi Zelnik-Manor,et al.
Context-Aware Saliency Detection
,
2012,
IEEE Trans. Pattern Anal. Mach. Intell..
[4]
Cees Snoek,et al.
VideoLSTM convolves, attends and flows for action recognition
,
2016,
Comput. Vis. Image Underst..
[5]
Ivan Laptev,et al.
On Space-Time Interest Points
,
2005,
International Journal of Computer Vision.