Cross-modal Orthogonal High-rank Augmentation for RGB-Event Transformer-trackers

This paper addresses the problem of cross-modal object tracking from RGB videos and event data. Rather than constructing a complex cross-modal fusion network, we explore the great potential of a pre-trained vision Transformer (ViT). Particularly, we delicately investigate plug-and-play training augmentations that encourage the ViT to bridge the vast distribution gap between the two modalities, enabling comprehensive cross-modal information interaction and thus enhancing its ability. Specifically, we propose a mask modeling strategy that randomly masks a specific modality of some tokens to enforce the interaction between tokens from different modalities interacting proactively. To mitigate network oscillations resulting from the masking strategy and further amplify its positive effect, we then theoretically propose an orthogonal high-rank loss to regularize the attention matrix. Extensive experiments demonstrate that our plug-and-play training augmentation techniques can significantly boost state-of-the-art one-stream and two-stream trackers to a large extent in terms of both tracking precision and success rate. Our new perspective and findings will potentially bring insights to the field of leveraging powerful pre-trained ViTs to model cross-modal data. The code is publicly available at https://github.com/ZHU-Zhiyu/High-Rank_RGB-Event_Tracker.

[1]  D. Cremers,et al.  Masked Event Modeling: Self-Supervised Pretraining for Event Cameras , 2022, 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).

[2]  Yonghong Tian,et al.  Revisiting Color-Event based Tracking: A Unified Network, Dataset, and Metric , 2022, ArXiv.

[3]  Junsong Yuan,et al.  AiATrack: Attention in Attention for Transformer Visual Tracking , 2022, ECCV.

[4]  Felix Heide,et al.  Spiking Transformers for Event-based Single Object Tracking , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Yonghong Tian,et al.  Asynchronous Spatio-Temporal Memory Network for Continuous Event-Based Object Detection , 2022, IEEE Transactions on Image Processing.

[6]  S. Shan,et al.  Joint Feature Learning and Relation Modeling for Tracking: A One-Stream Framework , 2022, ECCV.

[7]  L. Gool,et al.  Transforming Model Prediction for Tracking , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  L. Gool,et al.  Robust Visual Tracking by Segmentation , 2022, ECCV.

[9]  Wanli Ouyang,et al.  Backbone is All Your Need: A Simplified Architecture for Visual Object Tracking , 2022, ECCV.

[10]  R. Stiefelhagen,et al.  CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation With Transformers , 2022, IEEE Transactions on Intelligent Transportation Systems.

[11]  Michael Felsberg,et al.  Visual Object Tracking With Discriminative Filters and Siamese Networks: A Survey and Outlook , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Shijian Lu,et al.  PTTR: Relational 3D Point Cloud Object Tracking with Transformer , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Haibin Ling,et al.  SwinTrack: A Simple and Strong Baseline for Transformer Tracking , 2021, NeurIPS.

[14]  Li Dong,et al.  Swin Transformer V2: Scaling Up Capacity and Resolution , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Han Hu,et al.  SimMIM: a Simple Framework for Masked Image Modeling , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Ross B. Girshick,et al.  Masked Autoencoders Are Scalable Vision Learners , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Thierry Bouwmans,et al.  Moving Object Detection for Event-based Vision using Graph Spectral Clustering , 2021, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

[18]  Kuk-Jin Yoon,et al.  SiamEvent: Event-based Object Tracking via Edge-aware Similarity Learning with Siamese Networks , 2021, ArXiv.

[19]  Bo Dong,et al.  Object Tracking by Jointly Exploiting Frame and Event Domain , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[20]  Qingfu Zhang,et al.  Semantic-embedded Unsupervised Spectral Reconstruction from Single RGB Images in the Wild , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[21]  Xiaokang Yang,et al.  PointAugmenting: Cross-Modal Augmentation for 3D Object Detection , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Jianlong Fu,et al.  Learning Spatio-Temporal Transformer for Visual Tracking , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Luc Van Gool,et al.  Learning Target Candidate Association to Keep Track of What Not to Track , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[24]  Huchuan Lu,et al.  Transformer Tracking , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Wengang Zhou,et al.  Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[27]  Etienne Perot,et al.  Learning to Detect Objects with a 1 Megapixel Event Camera , 2020, NeurIPS.

[28]  Garrick Orchard,et al.  e-TLD: Event-Based Framework for Dynamic Object Tracking , 2020, IEEE Transactions on Circuits and Systems for Video Technology.

[29]  Ling Zhou,et al.  Cross-Modal Pattern-Propagation for RGB-T Tracking , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Zhiwei Xiong,et al.  Tracking by Instance Detection: A Meta-Learning Approach , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Luc Van Gool,et al.  Probabilistic Regression for Visual Tracking , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  L. Gool,et al.  Know Your Surroundings: Exploiting Scene Information for Object Tracking , 2020, ECCV.

[33]  Etienne Perot,et al.  A Large Scale Event-based Detection Dataset for Automotive , 2020, ArXiv.

[34]  Philip H. S. Torr,et al.  Siam R-CNN: Visual Tracking by Re-Detection , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Raoul de Charette,et al.  xMUDA: Cross-Modal Unsupervised Domain Adaptation for 3D Semantic Segmentation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Luping Shi,et al.  Robust Event-Based Object Tracking Combining Correlation Filter and CNN Representation , 2019, Front. Neurorobot..

[37]  Davide Scaramuzza,et al.  Correction to: EKLT: Asynchronous Photometric Feature Tracking Using Events and Frames , 2019, International Journal of Computer Vision.

[38]  Davide Scaramuzza,et al.  Event-based, Direct Camera Tracking from a Photometric 3D Map using Nonlinear Optimization , 2019, 2019 International Conference on Robotics and Automation (ICRA).

[39]  Chiara Bartolozzi,et al.  Event-Based Vision: A Survey , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  L. Gool,et al.  Learning Discriminative Model Prediction for Tracking , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[41]  Yang Wang,et al.  Cross-Modal Self-Attention Network for Referring Image Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Wei Wu,et al.  SiamRPN++: Evolution of Siamese Visual Tracking With Very Deep Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Kostas Daniilidis,et al.  Unsupervised Event-Based Learning of Optical Flow, Depth, and Egomotion , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Michael Felsberg,et al.  ATOM: Accurate Tracking by Overlap Maximization , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Bohyung Han,et al.  Real-Time MDNet , 2018, ECCV.

[46]  Davide Scaramuzza,et al.  EKLT: Asynchronous Photometric Feature Tracking Using Events and Frames , 2018, International Journal of Computer Vision.

[47]  Wei Wu,et al.  High Performance Visual Tracking with Siamese Region Proposal Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[48]  Davide Scaramuzza,et al.  A Unifying Contrast Maximization Framework for Event Cameras, with Applications to Motion, Depth, and Optical Flow Estimation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[49]  Yiannis Aloimonos,et al.  Event-Based Moving Object Detection and Tracking , 2018, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[50]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[51]  Graham W. Taylor,et al.  Deep Multimodal Learning: A Survey on Recent Advances and Trends , 2017, IEEE Signal Processing Magazine.

[52]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[53]  Davide Scaramuzza,et al.  Low-latency visual odometry using event-based feature tracks , 2016, 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[54]  Luca Bertinetto,et al.  Fully-Convolutional Siamese Networks for Object Tracking , 2016, ECCV Workshops.

[55]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Bohyung Han,et al.  Learning Multi-domain Convolutional Neural Networks for Visual Tracking , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[58]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[59]  Junhui Hou,et al.  Learning Graph-embedded Key-event Back-tracing for Object Tracking in Event Clouds , 2022, NeurIPS.

[60]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[61]  Ling Shao,et al.  CLNet: A Compact Latent Network for Fast Adjusting Siamese Trackers , 2020, ECCV.

[62]  Shihao Zhang,et al.  Long-term object tracking with a moving event camera , 2018, BMVC.

[63]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .