Event Stream-based Visual Object Tracking: A High-Resolution Benchmark Dataset and A Novel Baseline

Tracking using bio-inspired event cameras has drawn more and more attention in recent years. Existing works either utilize aligned RGB and event data for accurate tracking or directly learn an event-based tracker. The first category needs more cost for inference and the second one may be easily influenced by noisy events or sparse spatial resolution. In this paper, we propose a novel hierarchical knowledge distillation framework that can fully utilize multi-modal / multi-view information during training to facilitate knowledge transfer, enabling us to achieve high-speed and low-latency visual tracking during testing by using only event signals. Specifically, a teacher Transformer-based multi-modal tracking framework is first trained by feeding the RGB frame and event stream simultaneously. Then, we design a new hierarchical knowledge distillation strategy which includes pairwise similarity, feature representation, and response maps-based knowledge distillation to guide the learning of the student Transformer network. Moreover, since existing event-based tracking datasets are all low-resolution ($346 \times 260$), we propose the first large-scale high-resolution ($1280 \times 720$) dataset named EventVOT. It contains 1141 videos and covers a wide range of categories such as pedestrians, vehicles, UAVs, ping pongs, etc. Extensive experiments on both low-resolution (FE240hz, VisEvent, COESOT), and our newly proposed high-resolution EventVOT dataset fully validated the effectiveness of our proposed method. The dataset, evaluation toolkit, and source code are available on \url{https://github.com/Event-AHU/EventVOT_Benchmark}

[1]  J. Kittler,et al.  Distillation, Ensemble and Selection for Building a Better and Faster Siamese Based Tracker , 2024, IEEE Transactions on Circuits and Systems for Video Technology.

[2]  Dapeng Oliver Wu,et al.  Cross-modal Orthogonal High-rank Augmentation for RGB-Event Transformer-trackers , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[3]  Jungong Han,et al.  Efficient RGB-T Tracking via Cross-Modality Distillation , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Meng Li,et al.  Frame-Event Alignment and Fusion Network for High Frame Rate Tracking , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Huchuan Lu,et al.  Visual Prompt Multi-Modal Tracking , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Yonghong Tian,et al.  Revisiting Color-Event based Tracking: A Unified Network, Dataset, and Metric , 2022, ArXiv.

[7]  Junsong Yuan,et al.  AiATrack: Attention in Attention for Transformer Visual Tracking , 2022, ECCV.

[8]  Felix Heide,et al.  Spiking Transformers for Event-based Single Object Tracking , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  S. Shan,et al.  Joint Feature Learning and Relation Modeling for Tracking: A One-Stream Framework , 2022, ECCV.

[10]  Limin Wang,et al.  MixFormer: End-to-End Tracking with Iterative Mixed Attention , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  L. Gool,et al.  Transforming Model Prediction for Tracking , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Wanli Ouyang,et al.  Backbone is All Your Need: A Simplified Architecture for Visual Object Tracking , 2022, ECCV.

[13]  Bo Dong,et al.  Object Tracking by Jointly Exploiting Frame and Event Domain , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[14]  Yonghong Tian,et al.  VisEvent: Reliable Object Tracking via Collaboration of Frame and Event Flows , 2021, IEEE Transactions on Cybernetics.

[15]  Junfei Zhuang,et al.  Ensemble learning with siamese networks for visual tracking , 2021, Neurocomputing.

[16]  Yihao Liu,et al.  Learn to Match: Automatic Matching Network Design for Visual Tracking , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[17]  Abel Gonzalez-Garcia,et al.  Unsupervised Cross-Modal Distillation for Thermal Infrared Tracking , 2021, ACM Multimedia.

[18]  Youfu Li,et al.  Learning From Images: A Distillation Learning Framework for Event Cameras , 2021, IEEE Transactions on Image Processing.

[19]  Jianlong Fu,et al.  Learning Spatio-Temporal Transformer for Visual Tracking , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[20]  Huchuan Lu,et al.  Transformer Tracking , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Wengang Zhou,et al.  Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Zhipeng Zhang,et al.  Ocean: Object-aware Anchor-free Tracking , 2020, ECCV.

[23]  Jianping Gou,et al.  Knowledge Distillation: A Survey , 2020, International Journal of Computer Vision.

[24]  Luc Van Gool,et al.  Probabilistic Regression for Visual Tracking , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  L. Gool,et al.  Know Your Surroundings: Exploiting Scene Information for Object Tracking , 2020, ECCV.

[26]  Shengping Zhang,et al.  Siamese Box Adaptive Network for Visual Tracking , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[28]  Seyed Mojtaba Marvasti-Zadeh,et al.  Deep Learning for Visual Tracking: A Comprehensive Survey , 2019, IEEE Transactions on Intelligent Transportation Systems.

[29]  Chunhui Zhang,et al.  Distilling Channels for Efficient Deep Tracking , 2019, IEEE Transactions on Image Processing.

[30]  Xinbo Gao,et al.  Asynchronous Tracking-by-Detection on Adaptive Time Surfaces for Event-based Object Tracking , 2019, ACM Multimedia.

[31]  F. Khan,et al.  Distilled Siamese Networks for Visual Tracking , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Wen-gang Zhou,et al.  Real-Time Correlation Tracking Via Joint Model Compression and Transfer , 2019, IEEE Transactions on Image Processing.

[33]  Chiara Bartolozzi,et al.  Event-Based Vision: A Survey , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  L. Gool,et al.  Learning Discriminative Model Prediction for Tracking , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[35]  Qiang Wang,et al.  Fast Online Object Tracking and Segmentation: A Unifying Approach , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Michael Felsberg,et al.  ATOM: Accurate Tracking by Overlap Maximization , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Hei Law,et al.  CornerNet: Detecting Objects as Paired Keypoints , 2018, International Journal of Computer Vision.

[38]  Davide Scaramuzza,et al.  EKLT: Asynchronous Photometric Feature Tracking Using Events and Frames , 2018, International Journal of Computer Vision.

[39]  Wei Wu,et al.  High Performance Visual Tracking with Siamese Region Proposal Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40]  Shoushun Chen,et al.  Event-Guided Structured Output Tracking of Fast-Moving Objects Using a CeleX Sensor , 2018, IEEE Transactions on Circuits and Systems for Video Technology.

[41]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[42]  Luca Bertinetto,et al.  Fully-Convolutional Siamese Networks for Object Tracking , 2016, ECCV Workshops.

[43]  A. Smeulders,et al.  Siamese Instance Search for Tracking , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Bohyung Han,et al.  Learning Multi-domain Convolutional Neural Networks for Visual Tracking , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  ● Pytorch,et al.  Attention! , 1998, Trends in Cognitive Sciences.

[46]  Bineng Zhong,et al.  Teacher-student knowledge distillation for real-time correlation tracking , 2022, Neurocomputing.

[47]  Junhui Hou,et al.  Learning Graph-embedded Key-event Back-tracing for Object Tracking in Event Clouds , 2022, NeurIPS.