Attentional Fused Temporal Transformation Network for Video Action Recognition

Effective spatiotemporal feature representation is crucial to the video-based action recognition task. Focusing on discriminate spatiotemporal feature learning, we propose Attentional Fused Temporal Transformation Network (AttnTTN) for action recognition on top of popular Temporal Segment Network (TSN) framework. In the network, Attentional Fusion Module (AttnFM) is designed to fuse the appearance and motion features at multiple ConvNet levels for each video snippet, forming a short-term video descriptor. With fused features as inputs, Temporal Transformation Networks (TTN) are employed to model middle-term temporal transformation between the neighboring temporal snippets following a sequential order. AttnTTN achieves the state-of-the-art results on two most popular action recognition datasets: UCF101 and HMDB51.

[1]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[2]  Nicu Sebe,et al.  Spatio-Temporal Vector of Locally Max Pooled Features for Action Recognition in Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Cordelia Schmid,et al.  Long-Term Temporal Convolutions for Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Alexander J. Smola,et al.  Compressed Video Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5]  Wei Zhang,et al.  Optical Flow Guided Feature: A Fast and Robust Motion Representation for Video Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6]  Richard P. Wildes,et al.  Spatiotemporal Multiplier Networks for Video Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Yong Dou,et al.  Exploring Temporal Preservation Networks for Precise Temporal Action Localization , 2017, AAAI.

[8]  Andrea Vedaldi,et al.  Dynamic Image Networks for Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[10]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[12]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[13]  Bolei Zhou,et al.  Temporal Relational Reasoning in Videos , 2017, ECCV.

[14]  Yong Dou,et al.  Exploring Frame Segmentation Networks for Temporal Action Localization , 2019, J. Vis. Commun. Image Represent..

[15]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[16]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Bowen Zhang,et al.  Real-Time Action Recognition with Enhanced Motion Vector CNNs , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Yong Dou,et al.  Towards Precise End-to-End Weakly Supervised Object Detection Network , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Philip S. Yu,et al.  Spatiotemporal Pyramid Network for Video Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Richard P. Wildes,et al.  Spatiotemporal Residual Networks for Video Action Recognition , 2016, NIPS.

[21]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[23]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[24]  Limin Wang,et al.  Video Action Detection with Relational Dynamic-Poselets , 2014, ECCV.

[25]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.