Attention-based spatial-temporal hierarchical ConvLSTM network for action recognition in videos

Human action recognition in videos is an important research topic in computer vision due to its wide applications. Actions naturally contain both spatial and temporal information. The key to action recognition is to model the spatial and temporal structures of actions. In this study, the authors propose an attention-based spatial–temporal hierarchical convolutional long short-term memory (ST-HConvLSTM) network to model the structures of actions in the spatial and temporal domains. The ST-HConvLSTM consists of two parts: a spatial–temporal attention module and a novel LSTM-like architecture named hierarchical ConvLSTM (HConvLSTM). The HConvLSTM can model the spatial and temporal structures of actions. The spatial–temporal attention module can figure out which part of the video is more discriminative for action recognition and makes the HConvLSTM focus on it. In addition, a weighted fusion strategy is proposed to fuse the appearance information and motion information of the video. The proposed ST-HConvLSTM is evaluated on UCF101, HMDB51 and Kinetics datasets. Experimental results show that the authors’ proposed ST-HConvLSTM achieves state-of-the-art performance compared with other recent LSTM-like architectures and attention-based methods.

[1]  Ying Zhang,et al.  HMDB: the Human Metabolome Database , 2007, Nucleic Acids Res..

[2]  Li Fei-Fei,et al.  Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos , 2015, International Journal of Computer Vision.

[3]  Lihi Zelnik-Manor,et al.  Context-Aware Saliency Detection , 2012, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Jeremy S. Smith,et al.  Hierarchical Multi-scale Attention Networks for action recognition , 2017, Signal Process. Image Commun..

[5]  George K. I. Mann,et al.  An Object-Based Visual Attention Model for Robotic Applications , 2010, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[6]  Cees Snoek,et al.  VideoLSTM convolves, attends and flows for action recognition , 2016, Comput. Vis. Image Underst..

[7]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[8]  Limin Wang,et al.  MoFAP: A Multi-level Representation for Action Recognition , 2015, International Journal of Computer Vision.

[9]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[10]  Darrell Whitley,et al.  A genetic algorithm tutorial , 1994, Statistics and Computing.

[11]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[12]  Ronald Poppe,et al.  A survey on vision-based human action recognition , 2010, Image Vis. Comput..