论文信息 - Channel-Temporal Attention for First-Person Video Domain Adaptation

Channel-Temporal Attention for First-Person Video Domain Adaptation

Unsupervised Domain Adaptation (UDA) can transfer knowledge from labeled source data to unlabeled target data of the same categories. However, UDA for first-person action recognition is an under-explored problem, with lack of datasets and limited consideration of first-person video characteristics. This paper focuses on addressing this problem. Firstly, we propose two small-scale first-person video domain adaptation datasets: ADLsmall and GTEAKITCHEN. Secondly, we introduce channel-temporal attention blocks to capture the channel-wise and temporal-wise relationships and model their inter-dependencies important to first-person vision. Finally, we propose a ChannelTemporal Attention Network (CTAN) to integrate these blocks into existing architectures. CTAN outperforms baselines on the two proposed datasets and one existing dataset EPICcvpr20.

Haiping Lu | Shuo Zhou | Xianyuan Liu | Tao Lei

[1] Bolei Zhou,et al. Temporal Relational Reasoning in Videos , 2017, ECCV.

[2] François Laviolette,et al. Domain-Adversarial Training of Neural Networks , 2015, J. Mach. Learn. Res..

[3] Raoul de Charette,et al. xMUDA: Cross-Modal Unsupervised Domain Adaptation for 3D Semantic Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4] Ruxin Chen,et al. Temporal Attentive Alignment for Large-Scale Video Domain Adaptation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[5] Andrew Zisserman,et al. Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[6] Trevor Darrell,et al. Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7] Shih-Fu Chang,et al. ConvNet Architecture Search for Spatiotemporal Feature Learning , 2017, ArXiv.

[8] Thomas Serre,et al. HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[9] Andrew Zisserman,et al. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10] Jessica K. Hodgins,et al. Guide to the Carnegie Mellon University Multimodal Activity (CMU-MMAC) Database , 2008 .

[11] Dima Damen,et al. Multi-Modal Domain Adaptation for Fine-Grained Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[12] Luca Antiga,et al. Automatic differentiation in PyTorch , 2017 .

[13] Mubarak Shah,et al. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[14] James M. Rehg,et al. Delving into egocentric actions , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15] Michael I. Jordan,et al. Learning Transferable Features with Deep Adaptation Networks , 2015, ICML.

[16] Yann LeCun,et al. A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[17] Fabio Viola,et al. The Kinetics Human Action Video Dataset , 2017, ArXiv.

[18] George Trigeorgis,et al. Domain Separation Networks , 2016, NIPS.

[19] C. V. Jawahar,et al. First Person Action Recognition Using Deep Learned Descriptors , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20] Yunchao Wei,et al. Self-Similarity Grouping: A Simple Unsupervised Cross Domain Adaptation Approach for Person Re-Identification , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[21] K. S. Venkatesh,et al. Deep Domain Adaptation in Action Space , 2018, BMVC.

[22] Juan Carlos Niebles,et al. Adversarial Cross-Domain Action Recognition with Co-Attention , 2019, AAAI.

[23] Changick Kim,et al. Self-Training and Adversarial Background Regularization for Unsupervised Domain Adaptive One-Stage Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[24] Lorenzo Torresani,et al. Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[25] Qilong Wang,et al. Mind the Class Weight Bias: Weighted Maximum Mean Discrepancy for Unsupervised Domain Adaptation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26] D. Damen,et al. Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100 , 2020, International Journal of Computer Vision.

[27] Enhua Wu,et al. Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28] Dima Damen,et al. Scaling Egocentric Vision: The EPIC-KITCHENS Dataset , 2018, ArXiv.

[29] Chuang Gan,et al. TSM: Temporal Shift Module for Efficient Video Understanding , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[30] Kate Saenko,et al. Return of Frustratingly Easy Domain Adaptation , 2015, AAAI.

[31] James M. Rehg,et al. Learning to recognize objects in egocentric activities , 2011, CVPR 2011.

[32] Limin Wang,et al. Temporal Segment Networks for Action Recognition in Videos , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33] Andrew Zisserman,et al. Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34] Larry H. Matthies,et al. First-Person Activity Recognition: What Are They Doing to Me? , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[35] Michael I. Jordan,et al. Conditional Adversarial Domain Adaptation , 2017, NeurIPS.

[36] Mengjie Zhang,et al. Deep Reconstruction-Classification Networks for Unsupervised Domain Adaptation , 2016, ECCV.

[37] Deva Ramanan,et al. Detecting activities of daily living in first-person camera views , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.