Exploring Sequence Feature Alignment for Domain Adaptive Detection Transformers

Detection transformers have recently shown promising object detection results and attracted increasing attention. However, how to develop effective domain adaptation techniques to improve its cross-domain performance remains unexplored and unclear. In this paper, we delve into this topic and empirically find that direct feature distribution alignment on the CNN backbone only brings limited improvements, as it does not guarantee domain-invariant sequence features in the transformer for prediction. To address this issue, we propose a novel Sequence Feature Alignment (SFA) method that is specially designed for the adaptation of detection transformers. Technically, SFA consists of a domain query-based feature alignment (DQFA) module and a token-wise feature alignment (TDA) module. In DQFA, a novel domain query is used to aggregate and align global context from the token sequence of both domains. DQFA reduces the domain discrepancy in global feature representations and object relations when deploying in the transformer encoder and decoder, respectively. Meanwhile, TDA aligns token features in the sequence from both domains, which reduces the domain gaps in local and instance-level feature representations in the transformer encoder and decoder, respectively. Besides, a novel bipartite matching consistency loss is proposed to enhance the feature discriminability for robust object detection. Experiments on three challenging benchmarks show that SFA outperforms state-of-the-art domain adaptive object detection methods. Code has been made available at: https://github.com/encounter1997/SFA.

[1]  Hao Chen,et al.  FCOS: Fully Convolutional One-Stage Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[2]  Koby Crammer,et al.  A theory of learning from different domains , 2010, Machine Learning.

[3]  Matthew Johnson-Roberson,et al.  Driving in the Matrix: Can virtual worlds replace human-generated annotations for real world tasks? , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[4]  Zheng-Jun Zha,et al.  Deep Structure-Revealed Network for Texture Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Luc Van Gool,et al.  Domain Adaptive Faster R-CNN for Object Detection in the Wild , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6]  Ameet Talwalkar,et al.  Foundations of Machine Learning , 2012, Adaptive computation and machine learning.

[7]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[8]  Bing Li,et al.  Object Relational Graph With Teacher-Recommended Learning for Video Captioning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Ming-Hsuan Yang,et al.  Every Pixel Matters: Center-aware Feature Alignment for Domain Adaptive Object Detector , 2020, ECCV.

[10]  Dacheng Tao,et al.  Why ResNet Works? Residuals Generalize , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[11]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Weiming Lu,et al.  A Trigger-Sense Memory Flow Framework for Joint Entity and Relation Extraction , 2021, WWW.

[14]  Dacheng Tao,et al.  Empowering Things With Intelligence: A Survey of the Progress, Challenges, and Opportunities in Artificial Intelligence of Things , 2020, IEEE Internet of Things Journal.

[15]  MarchandMario,et al.  Domain-adversarial training of neural networks , 2016 .

[16]  Chong-Wah Ngo,et al.  Exploring Object Relation in Mean Teacher for Cross-Domain Detection , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Dacheng Tao,et al.  FAMED-Net: A Fast and Accurate Multi-Scale End-to-End Dehazing Network , 2019, IEEE Transactions on Image Processing.

[18]  Xiu-Shen Wei,et al.  Exploring Categorical Regularization for Domain Adaptive Object Detection , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Meng Wang,et al.  Visual query suggestion , 2009, ACM Multimedia.

[20]  Bingbing Ni,et al.  Cross-Domain Detection via Graph-Induced Prototype Alignment , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Nathan Srebro,et al.  SPECTRALLY-NORMALIZED MARGIN BOUNDS FOR NEURAL NETWORKS , 2018 .

[22]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[23]  Tao Xu,et al.  On the Discrimination-Generalization Tradeoff in GANs , 2017, ICLR.

[24]  Yongdong Zhang,et al.  Context-Aware Visual Policy Network for Fine-Grained Image Captioning , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Jing Zhang,et al.  Deep Multiple-Attribute-Perceived Network for Real-World Texture Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[26]  Kecheng Zheng,et al.  Hierarchical Gumbel Attention Network for Text-based Person Search , 2020, ACM Multimedia.

[27]  François Laviolette,et al.  Domain-Adversarial Training of Neural Networks , 2015, J. Mach. Learn. Res..

[28]  Yongliang Shen,et al.  Locate and Label: A Two-stage Identifier for Nested Named Entity Recognition , 2021, ACL.

[29]  Zhuowen Tu,et al.  Deeply-Supervised Nets , 2014, AISTATS.

[30]  Bin Li,et al.  Deformable DETR: Deformable Transformers for End-to-End Object Detection , 2020, ICLR.

[31]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[32]  Junying Chen,et al.  UP-DETR: Unsupervised Pre-training for Object Detection with Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Luc Van Gool,et al.  Semantic Foggy Scene Understanding with Synthetic Data , 2017, International Journal of Computer Vision.

[34]  Dacheng Tao,et al.  Self-Attention with Cross-Lingual Position Representation , 2020, ACL.

[35]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[36]  Changick Kim,et al.  Self-Training and Adversarial Background Regularization for Unsupervised Domain Adaptive One-Stage Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[37]  Kecheng Zheng,et al.  Exploiting Sample Uncertainty for Domain Adaptive Person Re-Identification , 2020, AAAI.

[38]  Lei Zhang,et al.  Multi-Adversarial Faster-RCNN for Unrestricted Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[39]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Xinge Zhu,et al.  Adapting Object Detectors via Selective Cross-Domain Alignment , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Hanwang Zhang,et al.  Learning to Assemble Neural Module Tree Networks for Visual Grounding , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[42]  Kate Saenko,et al.  Strong-Weak Distribution Alignment for Adaptive Object Detection , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[44]  Tao Mei,et al.  Group-aware Label Transfer for Domain Adaptive Person Re-identification , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Dacheng Tao,et al.  Recent advances in deep learning theory , 2020, ArXiv.

[46]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[47]  Meng Lan,et al.  Global context based automatic road segmentation via dilated convolutional neural network , 2020, Inf. Sci..

[48]  Dacheng Tao,et al.  Category Anchor-Guided Unsupervised Domain Adaptation for Semantic Segmentation , 2019, NeurIPS.

[49]  Changick Kim,et al.  Diversify and Match: A Domain Adaptive Representation Learning Paradigm for Object Detection , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[51]  Zhe Chen,et al.  Recursive Context Routing for Object Detection , 2020, International Journal of Computer Vision.