AVD: Adversarial Video Distillation

In this paper, we present a simple yet efficient approach for video representation, called Adversarial Video Distillation (AVD). The key idea is to represent videos by compressing them in the form of realistic images, which can be used in a variety of video-based scene analysis applications. Representing a video as a single image enables us to address the problem of video analysis by image analysis techniques. To this end, we exploit a 3D convolutional encoder-decoder network to encode the input video as an image by minimizing the reconstruction error. Furthermore, weak supervision by an adversarial training procedure is imposed on the output of the encoder to generate semantically realistic images. The encoder learns to extract semantically meaningful representations from a given input video by mapping the 3D input into a 2D latent representation. The obtained representation can be simply used as the input of deep models pre-trained on images for video classification. We evaluated the effectiveness of our proposed method for video-based activity recognition on three standard and challenging benchmark datasets, i.e. UCF101, HMDB51, and Kinetics. The experimental results demonstrate that AVD achieves interesting performance, outperforming the state-of-the-art methods for video classification.

[1]  Abdenour Hadid,et al.  A Spatiotemporal Convolutional Neural Network for Automatic Pain Intensity Estimation from Facial Dynamics , 2019, International Journal of Computer Vision.

[2]  Luc Van Gool,et al.  An Efficient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector , 2008, ECCV.

[3]  Tao Mei,et al.  Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[4]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[5]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[6]  Ming Yang,et al.  DeepFace: Closing the Gap to Human-Level Performance in Face Verification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[8]  Andrea Vedaldi,et al.  Dynamic Image Networks for Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[10]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[11]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Chen Sun,et al.  Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification , 2017, ECCV.

[13]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[14]  Limin Wang,et al.  MoFAP: A Multi-level Representation for Action Recognition , 2015, International Journal of Computer Vision.

[15]  Limin Wang,et al.  Computer Vision and Image Understanding Bag of Visual Words and Fusion Methods for Action Recognition: Comprehensive Study and Good Practice , 2022 .

[16]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[17]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Abdenour Hadid,et al.  Deep Discriminative Model for Video Classification , 2018, ECCV.

[19]  Cordelia Schmid,et al.  Long-Term Temporal Convolutions for Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Bolei Zhou,et al.  Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[23]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[24]  Lin Sun,et al.  Human Action Recognition Using Factorized Spatio-Temporal Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[25]  Anoop Cherian,et al.  Video Representation Learning Using Discriminative Pooling , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[26]  Luc Van Gool,et al.  Temporal 3D ConvNets Using Temporal Transition Layer , 2018, CVPR Workshops.

[27]  Shih-Fu Chang,et al.  ConvNet Architecture Search for Spatiotemporal Feature Learning , 2017, ArXiv.

[28]  Limin Wang,et al.  Temporal Segment Networks for Action Recognition in Videos , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Limin Wang,et al.  Multi-view Super Vector for Action Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[32]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[33]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Horst Bischof,et al.  A Duality Based Approach for Realtime TV-L1 Optical Flow , 2007, DAGM-Symposium.

[35]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Gang Sun,et al.  A Key Volume Mining Deep Framework for Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).