VideoDG: Generalizing Temporal Relations in Videos to Novel Domains

This paper introduces video domain generalization where most video classification networks degenerate due to the lack of exposure to the target domains of divergent distributions. We observe that the global temporal features are less generalizable, due to the temporal domain shift that videos from other unseen domains may have an unexpected absence or misalignment of the temporal relations. This finding has motivated us to solve video domain generalization by effectively learning the local-relation features of different timescales that are more generalizable, and exploiting them along with the global-relation features to maintain the discriminability. This paper presents the VideoDG framework with two technical contributions. The first is a new deep architecture named the Adversarial Pyramid Network, which improves the generalizability of video features by capturing the local-relation, global-relation, and cross-relation features progressively. On the basis of pyramid features, the second contribution is a new and robust approach of adversarial data augmentation that can bridge different video domains by improving the diversity and quality of augmented data. We construct three video domain generalization benchmarks in which domains are divided according to different datasets, different consequences of actions, or different camera views, respectively. VideoDG consistently outperforms the combinations of previous video classification models and existing domain generalization methods on all benchmarks.

[1]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[2]  Dima Damen,et al.  Multi-Modal Domain Adaptation for Fine-Grained Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[3]  Chuang Gan,et al.  TSM: Temporal Shift Module for Efficient Video Understanding , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[4]  Heng Wang,et al.  Video Classification With Channel-Separated Convolutional Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[5]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[7]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[8]  Silvio Savarese,et al.  Generalizing to Unseen Domains via Adversarial Data Augmentation , 2018, NeurIPS.

[9]  Yongxin Yang,et al.  Episodic Training for Domain Generalization , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  Tao Xiang,et al.  Domain Generalization: A Survey , 2021, ArXiv.

[11]  Jitendra Malik,et al.  SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[12]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[13]  Fabio Maria Carlucci,et al.  Domain Generalization by Solving Jigsaw Puzzles , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Yoshua Bengio,et al.  How transferable are features in deep neural networks? , 2014, NIPS.

[15]  Vittorio Murino,et al.  Model Vulnerability to Distributional Shifts over Image Transformation Sets , 2019, CVPR Workshops.

[16]  Bernhard Schölkopf,et al.  Support Vector Method for Novelty Detection , 1999, NIPS.

[17]  Bolei Zhou,et al.  Temporal Pyramid Network for Action Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Philip S. Yu,et al.  Spatiotemporal Pyramid Network for Video Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Wei Zhou,et al.  Feature-Critic Networks for Heterogeneous Domain Generalization , 2019, ICML.

[20]  Wei Zhang,et al.  Optical Flow Guided Feature: A Fast and Robust Motion Representation for Video Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21]  Daniel C. Castro,et al.  Domain Generalization via Model-Agnostic Learning of Semantic Features , 2019, NeurIPS.

[22]  Ghassan Al-Regib,et al.  Temporal Attentive Alignment for Video Domain Adaptation , 2019, ArXiv.

[23]  Davide Modolo,et al.  Action Recognition With Spatial-Temporal Discriminative Filter Banks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[24]  Zhangjie Cao,et al.  Open Domain Generalization with Domain-Augmented Meta-Learning , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Tao Mei,et al.  Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[26]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[27]  Mohammed Bennamoun,et al.  Attention in Convolutional LSTM for Gesture Recognition , 2018, NeurIPS.

[28]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[29]  Susanne Westphal,et al.  The “Something Something” Video Database for Learning and Evaluating Visual Common Sense , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[30]  Xi Peng,et al.  Learning to Learn Single Domain Generalization , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Juan Carlos Niebles,et al.  Adversarial Cross-Domain Action Recognition with Co-Attention , 2019, AAAI.

[32]  Toby P. Breckon,et al.  Style Augmentation: Data Augmentation via Style Randomization , 2018, CVPR Workshops.

[33]  Luc Van Gool,et al.  DynamoNet: Dynamic Action and Motion Network , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[34]  Koby Crammer,et al.  A theory of learning from different domains , 2010, Machine Learning.

[35]  K. S. Venkatesh,et al.  Deep Domain Adaptation in Action Space , 2018, BMVC.

[36]  Gang Wang,et al.  NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  John C. Duchi,et al.  Certifying Some Distributional Robustness with Principled Adversarial Training , 2017, ICLR.

[38]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[39]  Trevor Darrell,et al.  Adversarial Discriminative Domain Adaptation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Siddhartha Chaudhuri,et al.  Generalizing Across Domains via Cross-Gradient Training , 2018, ICLR.

[41]  Wei Wu,et al.  STM: SpatioTemporal and Motion Encoding for Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[42]  Alex ChiChung Kot,et al.  Domain Generalization with Adversarial Feature Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[43]  Jonathan Munro,et al.  Multi-Modal Domain Adaptation for Fine-Grained Action Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[45]  Michael I. Jordan,et al.  Theoretically Principled Trade-off between Robustness and Accuracy , 2019, ICML.

[46]  D. Tao,et al.  Deep Domain Generalization via Conditional Invariant Adversarial Networks , 2018, ECCV.

[47]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[48]  Yunbo Wang,et al.  Eidetic 3D LSTM: A Model for Video Prediction and Beyond , 2019, ICLR.

[49]  Michael I. Jordan,et al.  Conditional Adversarial Domain Adaptation , 2017, NeurIPS.

[50]  Deva Ramanan,et al.  Attentional Pooling for Action Recognition , 2017, NIPS.

[51]  Irfan A. Essa,et al.  Video Jigsaw: Unsupervised Learning of Spatiotemporal Context for Video Action Recognition , 2018, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[52]  David Lopez-Paz,et al.  In Search of Lost Domain Generalization , 2020, ICLR.

[53]  Bolei Zhou,et al.  Temporal Relational Reasoning in Videos , 2017, ECCV.

[54]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Chen Sun,et al.  Rethinking Spatiotemporal Feature Learning For Video Understanding , 2017, ArXiv.

[56]  Richard P. Wildes,et al.  Spatiotemporal Residual Networks for Video Action Recognition , 2016, NIPS.

[57]  Kaiming He,et al.  Long-Term Feature Banks for Detailed Video Understanding , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Mengjie Zhang,et al.  Domain Generalization for Object Recognition with Multi-task Autoencoders , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[59]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.