Extreme Low Resolution Action Recognition with Spatial-Temporal Multi-Head Self-Attention and Knowledge Distillation

This paper proposes a two-stream network with a novel spatial-temporal multi-head self-attention mechanism for action recognition in extreme low resolution (LR) videos. The new approach first utilizes a super resolution (SR) mechanism to provide better visual information to facilitate the network training. To provide more discriminative spatio-temporal features, a knowledge distillation scheme that consists of teacher and student models is employed to enhance the network model using the knowledge from a high resolution (HR) model. Moreover, the two-stream network is combined with a new spatial-temporal multi-head self-attention network to efficaciously learn the long-term temporal dependency. Simulations demonstrate that the proposed method surpasses the state-of-the-art works for extreme LR action recognition on two widespread HMDB-51 and IXMAS datasets.

[1]  Gregory Shakhnarovich,et al.  Deep Back-Projection Networks for Super-Resolution , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[2]  Yong Jae Lee,et al.  Learning to Anonymize Faces for Privacy Preserving Action Detection , 2018, ECCV.

[3]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[4]  Mohammed Bennamoun,et al.  Attention in Convolutional LSTM for Gesture Recognition , 2018, NeurIPS.

[5]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Alexander J. Smola,et al.  Compressed Video Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[7]  Lingfeng Wang,et al.  Pseudo low rank video representation , 2019, Pattern Recognit..

[8]  Li Fei-Fei,et al.  Privacy-Preserving Action Recognition for Smart Hospitals using Low-Resolution Depth Images , 2018, ArXiv.

[9]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[10]  Xiaoguang Zhao,et al.  PLS-CCA heterogeneous features fusion-based low-resolution human detection method for outdoor video surveillance , 2017, Int. J. Autom. Comput..

[11]  Thomas S. Huang,et al.  Studying Very Low Resolution Recognition Using Deep Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Forrest N. Iandola,et al.  SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size , 2016, ArXiv.

[13]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[14]  Sanghoon Lee,et al.  Ensemble Deep Learning for Skeleton-Based Action Recognition Using Temporal Sliding LSTM Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[15]  Michael S. Ryoo,et al.  Extreme Low Resolution Activity Recognition with Multi-Siamese Embedding Learning , 2017, AAAI.

[16]  Kyoung Mu Lee,et al.  Enhanced Deep Residual Networks for Single Image Super-Resolution , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[17]  Mariella Dimiccoli,et al.  Mitigating Bystander Privacy Concerns in Egocentric Activity Recognition with Deep Learning and Intentional Image Degradation , 2018, Proc. ACM Interact. Mob. Wearable Ubiquitous Technol..

[18]  Michael J. Black,et al.  On Human Motion Prediction Using Recurrent Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Christian Ledig,et al.  Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[21]  Jürgen Beyerer,et al.  Low-resolution Convolutional Neural Networks for video face recognition , 2016, 2016 13th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[22]  Andrew Zisserman,et al.  Video Action Transformer Network , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[24]  Michael S. Ryoo,et al.  Privacy-Preserving Human Activity Recognition from Extreme Low Resolution , 2016, AAAI.

[25]  Janusz Konrad,et al.  Semi-Coupled Two-Stream Fusion ConvNets for Action Recognition at Extremely Low Resolutions , 2016, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).

[26]  Xin Chen,et al.  Fully-Coupled Two-Stream Spatiotemporal Networks for Extremely Low Resolution Action Recognition , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[27]  Jingdong Wang,et al.  Interleaved Group Convolutions , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[28]  Xiaoou Tang,et al.  Image Super-Resolution Using Deep Convolutional Networks , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Hassen Drira,et al.  Coding Kendall's Shape Trajectories for 3D Action Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30]  John See,et al.  Exploiting textures for better action recognition in low-quality videos , 2017, EURASIP J. Image Video Process..

[31]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  John See,et al.  Deep CNN object features for improved action recognition in low quality videos , 2016, IEEE CSE 2016.

[33]  Janusz Konrad,et al.  Towards privacy-preserving activity recognition using extremely low temporal and spatial resolution cameras , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[34]  Shang Gao,et al.  Hierarchical Convolutional Attention Networks for Text Classification , 2018, Rep4NLP@ACL.

[35]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[36]  Pascal Fua,et al.  Making Action Recognition Robust to Occlusions and Viewpoint Changes , 2010, ECCV.

[37]  Tieniu Tan,et al.  Wavelet-SRNet: A Wavelet-Based CNN for Multi-scale Face Super Resolution , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).