Mamba-FETrack: Frame-Event Tracking via State Space Model

RGB-Event based tracking is an emerging research topic, focusing on how to effectively integrate heterogeneous multi-modal data (synchronized exposure video frames and asynchronous pulse Event stream). Existing works typically employ Transformer based networks to handle these modalities and achieve decent accuracy through input-level or feature-level fusion on multiple datasets. However, these trackers require significant memory consumption and computational complexity due to the use of self-attention mechanism. This paper proposes a novel RGB-Event tracking framework, Mamba-FETrack, based on the State Space Model (SSM) to achieve high-performance tracking while effectively reducing computational costs and realizing more efficient tracking. Specifically, we adopt two modality-specific Mamba backbone networks to extract the features of RGB frames and Event streams. Then, we also propose to boost the interactive learning between the RGB and Event features using the Mamba network. The fused features will be fed into the tracking head for target object localization. Extensive experiments on FELT and FE108 datasets fully validated the efficiency and effectiveness of our proposed tracker. Specifically, our Mamba-based tracker achieves 43.5/55.6 on the SR/PR metric, while the ViT-S based tracker (OSTrack) obtains 40.0/50.9. The GPU memory cost of ours and ViT-S based tracker is 13.98GB and 15.44GB, which decreased about $9.5\%$. The FLOPs and parameters of ours/ViT-S based OSTrack are 59GB/1076GB and 7MB/60MB, which decreased about $94.5\%$ and $88.3\%$, respectively. We hope this work can bring some new insights to the tracking field and greatly promote the application of the Mamba architecture in tracking. The source code of this work will be released on \url{https://github.com/Event-AHU/Mamba_FETrack}.

[1]  Yaowei Wang,et al.  State Space Model for New-Generation Network Alternative to Transformers: A Survey , 2024, ArXiv.

[2]  Yu Zheng,et al.  Point Mamba: A Novel Point Cloud Backbone Based on State Space Model with Octree-Based Ordering Strategy , 2024, ArXiv.

[3]  Bowei Jiang,et al.  Long-term Frame-Event Visual Tracking: Benchmark Dataset and Baseline , 2024, ArXiv.

[4]  Haobo Yuan,et al.  Point Cloud Mamba: Point Cloud Learning via State Space Model , 2024, ArXiv.

[5]  Mathias Gehrig,et al.  State Space Models for Event Cameras , 2024, ArXiv.

[6]  K. Yan,et al.  Pan-Mamba: Effective pan-sharpening with State Space Model , 2024, ArXiv.

[7]  Dingkang Liang,et al.  PointMamba: A Simple State Space Model for Point Cloud Analysis , 2024, ArXiv.

[8]  Ali Behrouz,et al.  Graph Mamba: Towards Learning on Graphs with State Space Models , 2024, ArXiv.

[9]  Shufan Li,et al.  Mamba-ND: Selective State Space Modeling for Multi-Dimensional Data , 2024, ArXiv.

[10]  Jiacheng Ruan,et al.  VM-UNet: Vision Mamba UNet for Medical Image Segmentation , 2024, ArXiv.

[11]  Chloe X. Wang,et al.  Graph-Mamba: Towards Long-Range Graph Sequence Modeling with Selective State Spaces , 2024, ArXiv.

[12]  Yijun Yang,et al.  SegMamba: Long-range Sequential Modeling Mamba For 3D Medical Image Segmentation , 2024, ArXiv.

[13]  Yunjie Tian,et al.  VMamba: Visual State Space Model , 2024, ArXiv.

[14]  Bencheng Liao,et al.  Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model , 2024, ArXiv.

[15]  Jun Ma,et al.  U-Mamba: Enhancing Long-range Dependency for Biomedical Image Segmentation , 2024, ArXiv.

[16]  Xianxian Li,et al.  ODTrack: Online Dense Temporal Token Learning for Visual Tracking , 2024, AAAI.

[17]  Albert Gu,et al.  Mamba: Linear-Time Sequence Modeling with Selective State Spaces , 2023, ArXiv.

[18]  Dapeng Oliver Wu,et al.  Cross-modal Orthogonal High-rank Augmentation for RGB-Event Transformer-trackers , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Meng Li,et al.  Frame-Event Alignment and Fusion Network for High Frame Rate Tracking , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Shenyuan Gao,et al.  Generalized Relation Modeling for Transformer Tracking , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Pichao Wang,et al.  Selective Structured State-Spaces for Long-Form Video Understanding , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Jared A. Dunnmon,et al.  Modeling Multivariate Biosignals With Graph Neural Networks and Structured State Space Models , 2022, CHIL.

[23]  Yonghong Tian,et al.  Revisiting Color-Event based Tracking: A Unified Network, Dataset, and Metric , 2022, ArXiv.

[24]  Christopher Ré,et al.  S4ND: Modeling Images and Videos as Multidimensional Signals Using State Spaces , 2022, ArXiv.

[25]  Scott W. Linderman,et al.  Simplified State Space Layers for Sequence Modeling , 2022, ICLR.

[26]  Junsong Yuan,et al.  AiATrack: Attention in Attention for Transformer Visual Tracking , 2022, ECCV.

[27]  Felix Heide,et al.  Spiking Transformers for Event-based Single Object Tracking , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Md. Mohaiminul Islam,et al.  Long Movie Clip Classification with State-Space Video Models , 2022, ECCV.

[29]  S. Shan,et al.  Joint Feature Learning and Relation Modeling for Tracking: A One-Stream Framework , 2022, ECCV.

[30]  Limin Wang,et al.  MixFormer: End-to-End Tracking with Iterative Mixed Attention , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  L. Gool,et al.  Transforming Model Prediction for Tracking , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Albert Gu,et al.  Efficiently Modeling Long Sequences with Structured State Spaces , 2021, ICLR.

[33]  Atri Rudra,et al.  Combining Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers , 2021, NeurIPS.

[34]  Bo Dong,et al.  Object Tracking by Jointly Exploiting Frame and Event Domain , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[35]  Yonghong Tian,et al.  VisEvent: Reliable Object Tracking via Collaboration of Frame and Event Flows , 2021, IEEE Transactions on Cybernetics.

[36]  Jianlong Fu,et al.  Learning Spatio-Temporal Transformer for Visual Tracking , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[37]  Huchuan Lu,et al.  Transformer Tracking , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[39]  C. Ré,et al.  HiPPO: Recurrent Memory with Optimal Polynomial Projections , 2020, NeurIPS.

[40]  Luc Van Gool,et al.  Probabilistic Regression for Visual Tracking , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Senzhang Wang,et al.  Interpretable Deep Learning Model for Online Multi-touch Attribution , 2020, ArXiv.

[42]  L. Gool,et al.  Know Your Surroundings: Exploiting Scene Information for Object Tracking , 2020, ECCV.

[43]  Shengping Zhang,et al.  Siamese Box Adaptive Network for Visual Tracking , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Xin Zhao,et al.  GlobalTrack: A Simple and Strong Baseline for Long-term Tracking , 2019, AAAI.

[45]  Xin Zhao,et al.  TANet: Robust 3D Object Detection from Point Clouds with Triple Attention , 2019, AAAI.

[46]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[47]  Gang Yu,et al.  SiamFC++: Towards Robust and Accurate Visual Tracking with Target Estimation Guidelines , 2019, AAAI.

[48]  Chiara Bartolozzi,et al.  Event-Based Vision: A Survey , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[49]  L. Gool,et al.  Learning Discriminative Model Prediction for Tracking , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[50]  Michael Felsberg,et al.  ATOM: Accurate Tracking by Overlap Maximization , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Bohyung Han,et al.  Real-Time MDNet , 2018, ECCV.

[52]  Xiao Wang,et al.  SINT++: Robust Visual Tracking via Adversarial Positive Instance Generation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[53]  Wei Wu,et al.  High Performance Visual Tracking with Siamese Region Proposal Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[54]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[55]  Luca Bertinetto,et al.  Fully-Convolutional Siamese Networks for Object Tracking , 2016, ECCV Workshops.

[56]  A. Smeulders,et al.  Siamese Instance Search for Tracking , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Bohyung Han,et al.  Learning Multi-domain Convolutional Neural Networks for Visual Tracking , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[59]  Ling Shao,et al.  CLNet: A Compact Latent Network for Fast Adjusting Siamese Trackers , 2020, ECCV.

[60]  R. E. Kalman,et al.  A New Approach to Linear Filtering and Prediction Problems , 2002 .