Frame-Event Alignment and Fusion Network for High Frame Rate Tracking

Most existing RGB-based trackers target low frame rate benchmarks of around 30 frames per second. This setting restricts the tracker's functionality in the real world, especially for fast motion. Event-based cameras as bioinspired sensors provide considerable potential for high frame rate tracking due to their high temporal resolution. However, event-based cameras cannot offer fine-grained texture information like conventional cameras. This unique complementarity motivates us to combine conventional frames and events for high frame rate object tracking under various challenging conditions. Inthispaper, we propose an end-to-end network consisting of multi-modality alignment and fusion modules to effectively combine meaningful information from both modalities at different measurement rates. The alignment module is responsible for cross-style and cross-frame-rate alignment between frame and event modalities under the guidance of the moving cues furnished by events. While the fusion module is accountable for emphasizing valuable features and suppressing noise information by the mutual complement between the two modalities. Extensive experiments show that the proposed approach outperforms state-of-the-art trackers by a significant margin in high frame rate tracking. With the FE240hz dataset, our approach achieves high frame rate tracking up to 240Hz.

[1]  P. Peers,et al.  In the Blink of an Eye: Event-based Emotion Recognition , 2023, SIGGRAPH.

[2]  Baocai Yin,et al.  A Geometrical Approach to Evaluate the Adversarial Robustness of Deep Neural Networks , 2023, ACM Trans. Multim. Comput. Commun. Appl..

[3]  Baocai Yin,et al.  Explore Contextual Information for 3D Scene Graph Generation , 2022, IEEE Transactions on Visualization and Computer Graphics.

[4]  Junsong Yuan,et al.  AiATrack: Attention in Attention for Transformer Visual Tracking , 2022, ECCV.

[5]  P. Luo,et al.  Towards Grand Unification of Object Tracking , 2022, ECCV.

[6]  Felix Heide,et al.  Biologically Inspired Dynamic Thresholds for Spiking Neural Networks , 2022, NeurIPS.

[7]  Felix Heide,et al.  Glass Segmentation using Intensity and Spectral Polarization Cues , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Felix Heide,et al.  Spiking Transformers for Event-based Single Object Tracking , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Peter Schneider-Kamp,et al.  DRHDR: A Dual branch Residual Network for Multi-Bracket High Dynamic Range Imaging , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[10]  Huchuan Lu,et al.  Visible-Thermal UAV Tracking: A Large-Scale Benchmark and New Baseline , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Wenjie Pei,et al.  Global Tracking via Ensemble of Local Trackers , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Philipp Krähenbühl,et al.  Global Tracking Transformers , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  L. Gool,et al.  Transforming Model Prediction for Tracking , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Yuxin Wang,et al.  CPRAL: Collaborative Panoptic-Regional Active Learning for Semantic Segmentation , 2021, AAAI.

[15]  Bo Dong,et al.  Object Tracking by Jointly Exploiting Frame and Event Domain , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[16]  Joni-Kristian Kämäräinen,et al.  DepthTrack: Unveiling the Power of RGBD Tracking , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[17]  Yonghong Tian,et al.  VisEvent: Reliable Object Tracking via Collaboration of Frame and Event Flows , 2021, ArXiv.

[18]  Dongliang He,et al.  Adaptive Spatial-Temporal Fusion of Multi-Objective Networks for Compressed Video Perceptual Enhancement , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[19]  Davide Scaramuzza,et al.  Time Lens: Event-based Video Frame Interpolation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Baocai Yin,et al.  A Two-Stage Attentive Network for Single Image Super-Resolution , 2021, IEEE Transactions on Circuits and Systems for Video Technology.

[21]  Xiaopeng Wei,et al.  Camouflaged Object Segmentation with Distraction Mining , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Jianlong Fu,et al.  Learning Spatio-Temporal Transformer for Visual Tracking , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Huchuan Lu,et al.  Transformer Tracking , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Yu Qiao,et al.  Attention-Guided Hierarchical Structure Aggregation for Image Matting , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Li Wang,et al.  Spatio-Temporal Deformable Convolution for Compressed Video Quality Enhancement , 2020, AAAI.

[26]  Luc Van Gool,et al.  Probabilistic Regression for Visual Tracking , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Hanzi Wang,et al.  End-to-end Learning of Object Motion Estimation from Retinal Events for Event-based Object Tracking , 2020, AAAI.

[28]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[29]  Xinbo Gao,et al.  Asynchronous Tracking-by-Detection on Adaptive Time Surfaces for Event-based Object Tracking , 2019, ACM Multimedia.

[30]  Chen Change Loy,et al.  EDVR: Video Restoration With Enhanced Deformable Convolutional Networks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[31]  Chiara Bartolozzi,et al.  Event-Based Vision: A Survey , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  L. Gool,et al.  Learning Discriminative Model Prediction for Tracking , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[33]  Chenliang Xu,et al.  TDAN: Temporally-Deformable Alignment Network for Video Super-Resolution , 2018, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  David Kim,et al.  The need 4 speed in real-time dense visual tracking , 2018, ACM Trans. Graph..

[35]  Michael Felsberg,et al.  ATOM: Accurate Tracking by Overlap Maximization , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Fan Yang,et al.  LaSOT: A High-Quality Benchmark for Large-Scale Single Object Tracking , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Wei Wu,et al.  High Performance Visual Tracking with Siamese Region Proposal Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[38]  Matthew A. Brown,et al.  Frame-Recurrent Video Super-Resolution , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[39]  Jan Kautz,et al.  Super SloMo: High Quality Estimation of Multiple Intermediate Frames for Video Interpolation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40]  W. Freeman,et al.  Video Enhancement with Task-Oriented Flow , 2017, International Journal of Computer Vision.

[41]  Michael Felsberg,et al.  The Visual Object Tracking VOT2017 Challenge Results , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[42]  Garrick Orchard,et al.  HOTS: A Hierarchy of Event-Based Time-Surfaces for Pattern Recognition , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Renjie Liao,et al.  Detail-Revealing Deep Video Super-Resolution , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[44]  Serge J. Belongie,et al.  Arbitrary Style Transfer in Real-Time with Adaptive Instance Normalization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[45]  Yi Li,et al.  Deformable Convolutional Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[46]  Simon Lucey,et al.  Need for Speed: A Benchmark for Higher Frame Rate Object Tracking , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[47]  Bernard Ghanem,et al.  A Benchmark and Simulator for UAV Tracking , 2016, ECCV.

[48]  Stefan Leutenegger,et al.  Real-Time 3D Reconstruction and 6-DoF Tracking with an Event Camera , 2016, ECCV.

[49]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[50]  Luca Bertinetto,et al.  Fully-Convolutional Siamese Networks for Object Tracking , 2016, ECCV Workshops.

[51]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Ming-Hsuan Yang,et al.  Hierarchical Convolutional Features for Visual Tracking , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[53]  Luca Bertinetto,et al.  Staple: Complementary Learners for Real-Time Tracking , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Bohyung Han,et al.  Learning Multi-domain Convolutional Neural Networks for Visual Tracking , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Ming-Hsuan Yang,et al.  Object Tracking Benchmark , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[56]  Tobi Delbrück,et al.  A USB3.0 FPGA event-based filtering and tracking framework for dynamic vision sensors , 2015, 2015 IEEE International Symposium on Circuits and Systems (ISCAS).

[57]  Andrew J. Davison,et al.  Real-Time Camera Tracking: When is High Frame-Rate Best? , 2012, ECCV.

[58]  Rui Caseiro,et al.  Exploiting the Circulant Structure of Tracking-by-Detection with Kernels , 2012, ECCV.

[59]  Stephan Schraml,et al.  Spatiotemporal multiple persons tracking using Dynamic Vision Sensor , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[60]  Bruce A. Draper,et al.  Visual object tracking using adaptive correlation filters , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[61]  A.N. Belbachir,et al.  Embedded Vision System for Real-Time Object Tracking using an Asynchronous Transient Vision Sensor , 2006, 2006 IEEE 12th Digital Signal Processing Workshop & 4th IEEE Signal Processing Education Workshop.

[62]  Hubert P. H. Shum,et al.  Tracking the translational and rotational movement of the ball using high-speed camera movies , 2005, IEEE International Conference on Image Processing 2005.

[63]  Xiaohong Liu,et al.  Video Frame Interpolation via Generalized Deformable Convolution , 2022, IEEE Transactions on Multimedia.

[64]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[65]  Huajin Tang,et al.  Event-based Action Recognition Using Motion Information and Spiking Neural Networks , 2021, IJCAI.