Exploiting Instance-based Mixed Sampling via Auxiliary Source Domain Supervision for Domain-adaptive Action Detection

We propose a novel domain adaptive action detection approach and a new adaptation protocol that leverages the recent advancements in image-level unsupervised domain adaptation (UDA) techniques and handle vagaries of instance-level video data. Self-training combined with cross-domain mixed sampling has shown remarkable performance gain in semantic segmentation in UDA (unsupervised domain adaptation) context. Motivated by this fact, we propose an approach for human action detection in videos that transfers knowledge from the source domain (annotated dataset) to the target domain (unannotated dataset) using mixed sampling and pseudo-label-based self-training. The existing UDA techniques follow a Class-Mix algorithm for semantic segmentation. However, simply adopting ClassMix for action detection does not work, mainly because these are two entirely different problems, i.e., pixel-label classification vs. instance-label detection. To tackle this, we propose a novel action instance mixed sampling technique that combines information across domains based on action instances instead of action classes. Moreover, we propose a new UDA training protocol that addresses the long-tail sample distribution and domain shift problem by using supervision from an auxiliary source domain (ASD). For the ASD, we propose a new action detection dataset with dense frame-level annotations. We name our proposed framework as domain-adaptive action instance mixing (DA-AIM). We demonstrate that DA-AIM consistently outperforms prior works on challenging domain adaptation benchmarks. The source code is available at https://github.com/wwwfan628/DA-AIM.

[1]  Jianping Shi,et al.  Context-Aware Mixup for Domain Adaptive Semantic Segmentation , 2021, IEEE Transactions on Circuits and Systems for Video Technology.

[2]  Yusuke Sugano,et al.  Interact before Align: Leveraging Cross-Modal Knowledge for Domain Adaptive Action Recognition , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Y. Rawat,et al.  End-to-End Semi-Supervised Learning for Video Action Detection , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  J. Malik,et al.  MViTv2: Improved Multiscale Vision Transformers for Classification and Detection , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  L. Gool,et al.  DAFormer: Improving Network Architectures and Training Strategies for Domain-Adaptive Semantic Segmentation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Cees G. M. Snoek,et al.  TubeR: Tubelet Transformer for Video Action Detection , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Liu Changyu,et al.  ultralytics/yolov5: v6.0 - YOLOv5n 'Nano' models, Roboflow integration, TensorFlow export, OpenCV DNN support , 2021 .

[8]  Jiannan Wu,et al.  Watch Only Once: An End-to-End Video Action Detection Framework , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Kate Saenko,et al.  Learning Cross-Modal Contrastive Features for Video Domain Adaptation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  Luc Van Gool,et al.  DLOW: Domain Flow and Applications , 2021, International Journal of Computer Vision.

[11]  Sicheng Zhao,et al.  Spatio-temporal Contrastive Domain Adaptation for Action Recognition , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Yuhui Yuan,et al.  Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Luke Melas-Kyriazi,et al.  PixMatch: Unsupervised Domain Adaptation via Pixelwise Consistency Training , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Yixuan Li,et al.  MultiSports: A Multi-Person Video Dataset of Spatio-Temporally Localized Sports Actions , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[15]  Nikita Araslanov,et al.  Self-supervised Augmentation Consistency for Adapting Semantic Segmentation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Luc Van Gool,et al.  Domain Adaptive Semantic Segmentation with Self-Supervised Depth Estimation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[17]  Yong Wang,et al.  Prototypical Pseudo Label Denoising and Target Structure Learning for Domain Adaptive Semantic Segmentation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Jing Zhang,et al.  Progressive Modality Cooperation for Multi-Modality Domain Adaptation , 2021, IEEE Transactions on Image Processing.

[19]  Yogesh Singh Rawat,et al.  We don't Need Thousand Proposals: Single Shot Actor-Action Detection in Videos , 2020, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[20]  L. Svensson,et al.  DACS: Domain Adaptation via Cross-domain Mixed Sampling , 2020, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[21]  Zheng Shou,et al.  Actor-Context-Actor Relation Network for Spatio-Temporal Action Localization , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Ming-Hsuan Yang,et al.  Unsupervised Domain Adaptation for Spatio-Temporal Action Localization , 2020, BMVC.

[23]  Andrew Zisserman,et al.  The AVA-Kinetics Localized Human Actions Video Dataset , 2020, ArXiv.

[24]  Cewu Lu,et al.  Asynchronous Interaction Aggregation for Action Detection , 2020, ECCV.

[25]  Stefano Soatto,et al.  FDA: Fourier Domain Adaptation for Semantic Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Sid Ying-Ze Bao,et al.  Action Segmentation With Joint Self-Supervised Temporal Domain Adaptation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Ghassan AlRegib,et al.  Action Segmentation with Mixed Temporal Domain Adaptation , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[28]  Jiaying Liu,et al.  Modality Compensation Network: Cross-Modal Adaptation for Action Recognition , 2020, IEEE Transactions on Image Processing.

[29]  David Berthelot,et al.  FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence , 2020, NeurIPS.

[30]  Yixuan Li,et al.  Actions as Moving Points , 2020, ECCV.

[31]  Juan Carlos Niebles,et al.  Adversarial Cross-Domain Action Recognition with Co-Attention , 2019, AAAI.

[32]  Tao Yang,et al.  Deep Image-to-Video Adaptation and Fusion Networks for Action Recognition , 2019, IEEE Transactions on Image Processing.

[33]  Gaurav Sharma,et al.  Shuffle and Attend: Video Domain Adaptation , 2020, ECCV.

[34]  Quanfu Fan,et al.  Multi-Moments in Time: Learning and Interpreting Models for Multi-Action Video Understanding , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  D. Damen,et al.  Multi-Modal Domain Adaptation for Fine-Grained Action Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Timo Aila,et al.  Semi-supervised semantic segmentation needs strong, high-dimensional perturbations , 2019 .

[37]  Changick Kim,et al.  Self-Ensembling With GAN-Based Data Augmentation for Domain Adaptation in Semantic Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[38]  Xiaofeng Liu,et al.  Confidence Regularized Self-Training , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[39]  Ruxin Chen,et al.  Temporal Attentive Alignment for Large-Scale Video Domain Adaptation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[40]  Yueting Zhuang,et al.  Self-Supervised Spatiotemporal Learning via Video Clip Order Prediction , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Seong Joon Oh,et al.  CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[42]  David Berthelot,et al.  MixMatch: A Holistic Approach to Semi-Supervised Learning , 2019, NeurIPS.

[43]  Jan Kautz,et al.  STEP: Spatio-Temporal Progressive Learning for Video Action Detection , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Kate Saenko,et al.  Strong-Weak Distribution Alignment for Adaptive Object Detection , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Kaiming He,et al.  Long-Term Feature Banks for Detailed Video Understanding , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Jitendra Malik,et al.  SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[47]  Patrick Pérez,et al.  ADVENT: Adversarial Entropy Minimization for Domain Adaptation in Semantic Segmentation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Yingli Tian,et al.  Self-supervised Spatiotemporal Feature Learning by Video Geometric Transformations , 2018, ArXiv.

[49]  Tao Mei,et al.  Recurrent Tubelet Proposal and Recognition Networks for Action Detection , 2018, ECCV.

[50]  Yang Zou,et al.  Domain Adaptation for Semantic Segmentation via Class-Balanced Self-Training , 2018, ArXiv.

[51]  Luc Van Gool,et al.  Model Adaptation with Synthetic and Real Data for Semantic Dense Foggy Scene Understanding , 2018, ECCV.

[52]  Jiaying Liu,et al.  Adaptive Batch Normalization for practical domain adaptation , 2018, Pattern Recognit..

[53]  Suman Saha,et al.  TraMNet - Transition Matrix Network for Efficient Action Tube Proposals , 2018, ACCV.

[54]  Cees Snoek,et al.  Pointly-Supervised Action Localization , 2018, International Journal of Computer Vision.

[55]  Mubarak Shah,et al.  VideoCapsuleNet: A Simplified Network for Action Detection , 2018, NeurIPS.

[56]  Luc Van Gool,et al.  Domain Adaptive Faster R-CNN for Object Detection in the Wild , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[57]  Ming-Hsuan Yang,et al.  Learning to Adapt Structured Output Space for Semantic Segmentation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[58]  Graham W. Taylor,et al.  Real-Time End-to-End Action Detection with Two-Stream Networks , 2018, 2018 15th Conference on Computer and Robot Vision (CRV).

[59]  Tatsuya Harada,et al.  Maximum Classifier Discrepancy for Unsupervised Domain Adaptation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[60]  Luc Van Gool,et al.  ROAD: Reality Oriented Adaptation for Semantic Segmentation of Urban Scenes , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[61]  Taesung Park,et al.  CyCADA: Cycle-Consistent Adversarial Domain Adaptation , 2017, ICML.

[62]  Cordelia Schmid,et al.  AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[63]  Suman Saha,et al.  Incremental Tube Construction for Human Action Detection , 2017, BMVC.

[64]  K. S. Venkatesh,et al.  Deep Domain Adaptation in Action Space , 2018, BMVC.

[65]  Donald A. Adjeroh,et al.  Unified Deep Supervised Domain Adaptation and Generalization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[66]  Daniel Cremers,et al.  Associative Domain Adaptation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[67]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[68]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[69]  Cordelia Schmid,et al.  Action Tubelet Detector for Spatio-Temporal Action Localization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[70]  Suman Saha,et al.  AMTnet: Action-Micro-Tube Regression by End-to-end Trainable Deep Architecture , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[71]  Rui Hou,et al.  Tube Convolutional Neural Network (T-CNN) for Action Detection in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[72]  Harri Valpola,et al.  Weight-averaged consistency targets improve semi-supervised deep learning results , 2017, ArXiv.

[73]  Suman Saha,et al.  Online Real-Time Multiple Spatiotemporal Action Localisation and Prediction , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[74]  Michael I. Jordan,et al.  Deep Transfer Learning with Joint Adaptation Networks , 2016, ICML.

[75]  Jiaying Liu,et al.  Revisiting Batch Normalization For Practical Domain Adaptation , 2016, ICLR.

[76]  Trevor Darrell,et al.  FCNs in the Wild: Pixel-level Adversarial and Constraint-based Adaptation , 2016, ArXiv.

[77]  Vladlen Koltun,et al.  Playing for Data: Ground Truth from Computer Games , 2016, ECCV.

[78]  Suman Saha,et al.  Deep Learning for Detecting Multiple Space-Time Action Tubes in Videos , 2016, BMVC.

[79]  Antonio M. López,et al.  The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[80]  François Laviolette,et al.  Domain-Adversarial Training of Neural Networks , 2015, J. Mach. Learn. Res..

[81]  MarchandMario,et al.  Domain-adversarial training of neural networks , 2016 .

[82]  Silvio Savarese,et al.  Learning Transferrable Representations for Unsupervised Domain Adaptation , 2016, NIPS.

[83]  Cordelia Schmid,et al.  Learning to Track for Spatio-Temporal Action Localization , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[84]  Michael I. Jordan,et al.  Learning Transferable Features with Deep Adaptation Networks , 2015, ICML.

[85]  Victor S. Lempitsky,et al.  Unsupervised Domain Adaptation by Backpropagation , 2014, ICML.

[86]  Trevor Darrell,et al.  Deep Domain Confusion: Maximizing for Domain Invariance , 2014, CVPR 2014.

[87]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[88]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[89]  Ling Shao,et al.  Enhancing Action Recognition by Cross-Domain Dictionary Learning , 2013, BMVC.

[90]  Dong-Hyun Lee,et al.  Pseudo-Label : The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks , 2013 .

[91]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[92]  T. Campos,et al.  Domain Adaptation in the Context of Sport Video Action Recognition , 2011 .

[93]  Trevor Darrell,et al.  Adapting Visual Category Models to New Domains , 2010, ECCV.