论文信息 - Temporal Attentive Alignment for Large-Scale Video Domain Adaptation

Temporal Attentive Alignment for Large-Scale Video Domain Adaptation

Although various image-based domain adaptation (DA) techniques have been proposed in recent years, domain shift in videos is still not well-explored. Most previous works only evaluate performance on small-scale datasets which are saturated. Therefore, we first propose two large-scale video DA datasets with much larger domain discrepancy: UCF-HMDB_full and Kinetics-Gameplay. Second, we investigate different DA integration methods for videos, and show that simultaneously aligning and learning temporal dynamics achieves effective alignment even without sophisticated DA methods. Finally, we propose Temporal Attentive Adversarial Adaptation Network (TA3N), which explicitly attends to the temporal dynamics using domain discrepancy for more effective domain alignment, achieving state-of-the-art performance on four video DA datasets (e.g. 7.9% accuracy gain over “Source only” from 73.9% to 81.8% on “HMDB --> UCF”, and 10.3% gain on “Kinetics --> Gameplay”). The code and data are released at http://github.com/cmhungsteve/TA3N.

[1] Jonathan Tompson,et al. Temporal Cycle-Consistency Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2] François Laviolette,et al. Domain-Adversarial Training of Neural Networks , 2015, J. Mach. Learn. Res..

[3] Xiao Liu,et al. Attention Clusters: Purely Attention Based Local Feature Integration for Video Classification , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4] Dong Xu,et al. Collaborative and Adversarial Network for Unsupervised Domain Adaptation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6] Edward K. Wong,et al. Dual many-to-one-encoder-based transfer learning for cross-dataset human action recognition , 2016, Image Vis. Comput..

[7] Luc Van Gool,et al. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[8] Trevor Darrell,et al. DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[9] Marcin Andrychowicz,et al. Sim-to-Real Transfer of Robotic Control with Dynamics Randomization , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[10] Fabio Viola,et al. The Kinetics Human Action Video Dataset , 2017, ArXiv.

[11] Michael I. Jordan,et al. Unsupervised Domain Adaptation with Residual Transfer Networks , 2016, NIPS.

[12] Qiang Yang,et al. A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[13] Kate Saenko,et al. Deep CORAL: Correlation Alignment for Deep Domain Adaptation , 2016, ECCV Workshops.

[14] Zsolt Kira,et al. Learning to cluster in order to Transfer across domains and tasks , 2017, ICLR.

[15] Michael I. Jordan,et al. Learning Transferable Features with Deep Adaptation Networks , 2015, ICML.

[16] Yann LeCun,et al. A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[17] Fei-Fei Li,et al. Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[18] Trevor Darrell,et al. Adversarial Discriminative Domain Adaptation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19] Changsheng Li,et al. Learning Transferable Self-attentive Representations for Action Recognition in Untrimmed Videos with Weak Supervision , 2019, AAAI.

[20] Razvan Pascanu,et al. A simple neural network module for relational reasoning , 2017, NIPS.

[21] Geoffrey E. Hinton,et al. Visualizing Data using t-SNE , 2008 .

[22] Kate Saenko,et al. Syn2Real: A New Benchmark forSynthetic-to-Real Visual Domain Adaptation , 2018, ArXiv.

[23] Geoffrey French,et al. Self-ensembling for visual domain adaptation , 2017, ICLR.

[24] Qilong Wang,et al. Mind the Class Weight Bias: Weighted Maximum Mean Discrepancy for Unsupervised Domain Adaptation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25] Michael I. Jordan,et al. Deep Transfer Learning with Joint Adaptation Networks , 2016, ICML.

[26] Jiaying Liu,et al. Adaptive Batch Normalization for practical domain adaptation , 2018, Pattern Recognit..

[27] Allan Jabri,et al. Learning Correspondence From the Cycle-Consistency of Time , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28] Jiaying Liu,et al. Revisiting Batch Normalization For Practical Domain Adaptation , 2016, ICLR.

[29] Bolei Zhou,et al. Temporal Relational Reasoning in Videos , 2017, ECCV.

[30] Zhao Chen,et al. GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks , 2017, ICML.

[31] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[32] Andrew Zisserman,et al. Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[33] Andrew Zisserman,et al. A Short Note about Kinetics-600 , 2018, ArXiv.

[34] Neil D. Lawrence,et al. Dataset Shift in Machine Learning , 2009 .

[35] Andrew Zisserman,et al. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36] Jianmin Wang,et al. Transferable Attention for Domain Adaptation , 2019, AAAI.

[37] Trevor Darrell,et al. Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38] Chen-Yu Lee,et al. Sliced Wasserstein Discrepancy for Unsupervised Domain Adaptation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39] Tao Mei,et al. Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[40] K. S. Venkatesh,et al. Deep Domain Adaptation in Action Space , 2018, BMVC.

[41] Tatsuya Harada,et al. Maximum Classifier Discrepancy for Unsupervised Domain Adaptation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[42] Edwin Lughofer,et al. Central Moment Discrepancy (CMD) for Domain-Invariant Representation Learning , 2017, ICLR.

[43] Roberto Cipolla,et al. Multi-task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[44] Victor S. Lempitsky,et al. Unsupervised Domain Adaptation by Backpropagation , 2014, ICML.

[45] Andrew Zisserman,et al. Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46] Kate Saenko,et al. Adversarial Dropout Regularization , 2017, ICLR.

[47] Imran Saleemi,et al. Human Action Recognition across Datasets by Foreground-Weighted Histogram Decomposition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[48] Asim Kadav,et al. Attend and Interact: Higher-Order Object Interactions for Video Understanding , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[49] Lorenzo Torresani,et al. Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[50] Mubarak Shah,et al. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[51] Gabriela Csurka,et al. A Comprehensive Survey on Domain Adaptation for Visual Applications , 2017, Domain Adaptation in Computer Vision Applications.

[52] Thomas Serre,et al. HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[53] Yoshua Bengio,et al. Generative Adversarial Nets , 2014, NIPS.

[54] Tatsuya Harada,et al. Open Set Domain Adaptation by Backpropagation , 2018, ECCV.

[55] Juergen Gall,et al. Open Set Domain Adaptation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[56] Ghassan Al-Regib,et al. TS-LSTM and Temporal-Inception: Exploiting Spatiotemporal Dynamics for Activity Recognition , 2017, Signal Process. Image Commun..

[57] Luca Antiga,et al. Automatic differentiation in PyTorch , 2017 .