Dual attention convolutional network for action recognition

Action recognition has been an active research area for many years. Extracting discriminative spatial and temporal features of different actions plays a key role in accomplishing this task. Current popular methods of action recognition are mainly based on two-stream Convolutional Networks (ConvNets) or 3D ConvNets. However, the computational cost of two-stream ConvNets is high for the requirement of optical flow while 3D ConvNets takes too much memory because they have a large amount of parameters. To alleviate such problems, the authors propose a Dual Attention ConvNet (DANet) based on dual attention mechanism which consists of spatial attention and temporal attention. The former concentrates on main motion objects in a video frame by using ConvNet structure and the latter captures related information of multiple video frames by adopting self-attention. Their network is entirely based on 2D ConvNet and takes in only RGB frames. Experimental results on UCF-101 and HMDB-51 benchmarks demonstrate that DANet gets comparable results among leading methods, which proves the effectiveness of the dual attention mechanism.

[1]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[2]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Lihi Zelnik-Manor,et al.  Context-Aware Saliency Detection , 2012, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Cees Snoek,et al.  VideoLSTM convolves, attends and flows for action recognition , 2016, Comput. Vis. Image Underst..

[5]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.