Switch and Refine: A Long-Term Tracking and Segmentation Framework

In long-term video object tracking (VOT) tasks, most long-term trackers are modified from short-term trackers, which contain more and more machine learning modules to improve their performance. However, we empirically find that more modules do not necessarily lead to better results. In this paper, we make the long-term tracking framework simple by carefully selecting the cutting-edge trackers. Specifically, we propose a new long-term VOT framework that combines the benefits of two mainstream short-term tracking pipelines, i.e., the discriminative online tracker and the one-shot Siamese tracker, with a global re-detector awakened when the target is lost. Such a framework fully exploits existing advanced works from three complementary perspectives. Experimental results show that by exploiting the capabilities of existing methods instead of designing new neural networks, we can still achieve remarkable results on seven long-term VOT datasets. By introducing a continuous adjustable speed control parameter, our tracker reaches 20+FPS with only a small performance loss. The refine module not only improves the bounding box estimations but also outputs segmentation masks, so that our framework can handle the video object segmentation (VOS) tasks by using only VOT trackers. We obtain a trade-off between time and accuracy on two representative VOS datasets by only using bounding boxes as the initial input.

[1]  Tianzhu Zhang,et al.  Target-Distractor Aware Deep Tracking With Discriminative Enhancement Learning Loss , 2022, IEEE Transactions on Circuits and Systems for Video Technology.

[2]  Chenglizhao Chen,et al.  A Novel Long-Term Iterative Mining Scheme for Video Salient Object Detection , 2022, IEEE Transactions on Circuits and Systems for Video Technology.

[3]  Zhiqiang Wei,et al.  A Twofold Convolutional Regression Tracking Network With Temporal and Spatial Mechanism , 2022, IEEE Transactions on Circuits and Systems for Video Technology.

[4]  Junwei Han,et al.  Scribble-Supervised Video Object Segmentation , 2022, IEEE/CAA Journal of Automatica Sinica.

[5]  Tianzhu Zhang,et al.  Object Tracking via Spatial-Temporal Memory Network , 2021, IEEE Transactions on Circuits and Systems for Video Technology.

[6]  Limin Wang,et al.  Fully Convolutional Online Tracking , 2020, Comput. Vis. Image Underst..

[7]  Wen-Hsien Fang,et al.  Spatial-Temporal Action Localization With Hierarchical Self-Attention , 2022, IEEE Transactions on Multimedia.

[8]  Yong Wang,et al.  The Ninth Visual Object Tracking VOT2021 Challenge Results , 2021, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

[9]  Yuyao Zhao,et al.  Mutual Learning and Feature Fusion Siamese Networks for Visual Object Tracking , 2021, IEEE Transactions on Circuits and Systems for Video Technology.

[10]  Longyin Wen,et al.  Detection, Tracking, and Counting Meets Drones in Crowds: A Benchmark , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Luc Van Gool,et al.  Learning Target Candidate Association to Keep Track of What Not to Track , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[12]  Huchuan Lu,et al.  Transformer Tracking , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Wengang Zhou,et al.  Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Yonghong Tian,et al.  Dynamic Attention Guided Multi-Trajectory Analysis for Single Object Tracking , 2021, IEEE Transactions on Circuits and Systems for Video Technology.

[15]  Parham Aarabi,et al.  SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Gui-Song Xia,et al.  Siamese networks with distractor-reduction method for long-term visual object tracking , 2020, Pattern Recognit..

[17]  Lin Yuan,et al.  LaSOT: A High-quality Large-scale Single Object Tracking Benchmark , 2020, International Journal of Computer Vision.

[18]  Garrick Orchard,et al.  e-TLD: Event-Based Framework for Dynamic Object Tracking , 2020, IEEE Transactions on Circuits and Systems for Video Technology.

[19]  Bin Yan,et al.  Alpha-Refine: Boosting Tracking Performance by Precise Bounding Box Estimation , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Jiri Matas,et al.  Performance Evaluation Methodology for Long-Term Single-Object Tracking , 2020, IEEE Transactions on Cybernetics.

[21]  Qinghua Hu,et al.  Multi-Drone-Based Single Object Tracking With Agent Sharing Network , 2020, IEEE Transactions on Circuits and Systems for Video Technology.

[22]  Yuping Zhang,et al.  Capturing Relevant Context for Visual Tracking , 2021, IEEE Transactions on Multimedia.

[23]  Feng Tang,et al.  Contour-Aware Long-Term Tracking With Reliable Re-Detection , 2020, IEEE Transactions on Circuits and Systems for Video Technology.

[24]  Jie Zhao,et al.  Online Filtering Training Samples for Robust Visual Tracking , 2020, ACM Multimedia.

[25]  Philip H. S. Torr,et al.  The Eighth Visual Object Tracking VOT2020 Challenge Results , 2020, ECCV Workshops.

[26]  Fei Zhou,et al.  Visual Saliency via Embedding Hierarchical Knowledge in a Deep Neural Network , 2020, IEEE Transactions on Image Processing.

[27]  Alexander Hauptmann,et al.  Robust Long-Term Object Tracking via Improved Discriminative Model Prediction , 2020, ECCV Workshops.

[28]  Zhipeng Zhang,et al.  Ocean: Object-aware Anchor-free Tracking , 2020, ECCV.

[29]  Stephen Lin,et al.  A Transductive Approach for Video Object Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Weilin Huang,et al.  Deformable Siamese Attention Networks for Visual Object Tracking , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Zhiwei Xiong,et al.  Tracking by Instance Detection: A Meta-Learning Approach , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Dong Wang,et al.  High-Performance Long-Term Tracking With Meta-Updater , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Luc Van Gool,et al.  Probabilistic Regression for Visual Tracking , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Felix Järemo Lawin,et al.  Learning What to Learn for Video Object Segmentation , 2020, ECCV.

[35]  L. Gool,et al.  Know Your Surroundings: Exploiting Scene Information for Object Tracking , 2020, ECCV.

[36]  Gang Yu,et al.  State-Aware Tracker for Real-Time Video Object Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Felix Järemo Lawin,et al.  Learning Fast and Robust Target Models for Video Object Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Junwei Han,et al.  SPFTN: A Joint Learning Framework for Localizing and Segmenting Objects in Weakly Labeled Videos , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Xin Zhao,et al.  GlobalTrack: A Simple and Strong Baseline for Long-term Tracking , 2019, AAAI.

[40]  Philip H. S. Torr,et al.  Siam R-CNN: Visual Tracking by Re-Detection , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Jiri Matas,et al.  D3S – A Discriminative Single Shot Segmentation Tracker , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Gang Yu,et al.  SiamFC++: Towards Robust and Accurate Visual Tracking with Target Estimation Guidelines , 2019, AAAI.

[43]  Yaonong Wang,et al.  PG-Net: Pixel to Global Matching Network for Visual Tracking , 2020, ECCV.

[44]  Zhenyu He,et al.  The Seventh Visual Object Tracking VOT2019 Challenge Results , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[45]  Huchuan Lu,et al.  ‘Skimming-Perusal’ Tracking: A Framework for Real-Time and Robust Long-Term Tracking , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[46]  Guizhong Liu,et al.  Flow Guided Short-Term Trackers with Cascade Detection for Long-Term Tracking , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[47]  Zhe Wang,et al.  Image Saliency Prediction in Transformed Domain: A Deep Complex Neural Network Method , 2019, AAAI.

[48]  L. Gool,et al.  Learning Discriminative Model Prediction for Tracking , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[49]  Dacheng Tao,et al.  Multi-Task Structure-Aware Context Modeling for Robust Keypoint-Based Object Tracking , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[50]  Ning Wang,et al.  Reliable Re-Detection for Long-Term Tracking , 2019, IEEE Transactions on Circuits and Systems for Video Technology.

[51]  Zhipeng Zhang,et al.  Deeper and Wider Siamese Networks for Real-Time Visual Tracking , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Wei Wu,et al.  SiamRPN++: Evolution of Siamese Visual Tracking With Very Deep Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Qiang Wang,et al.  Fast Online Object Tracking and Segmentation: A Unifying Approach , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Michael Felsberg,et al.  ATOM: Accurate Tracking by Overlap Maximization , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Fan Yang,et al.  LaSOT: A High-Quality Benchmark for Large-Scale Single Object Tracking , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Liu Guizhong,et al.  Flow Guided Short-Term Trackers with Cascade Detection for Long-Term Tracking , 2019 .

[57]  Jiri Matas,et al.  FuCoLoT - A Fully-Correlational Long-Term Tracker , 2018, ACCV.

[58]  Michael Felsberg,et al.  The Sixth Visual Object Tracking VOT2018 Challenge Results , 2018, ECCV Workshops.

[59]  Ning Xu,et al.  YouTube-VOS: A Large-Scale Video Object Segmentation Benchmark , 2018, ArXiv.

[60]  Wei Wu,et al.  High Performance Visual Tracking with Siamese Region Proposal Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[61]  Xiaojun Chang,et al.  Reinforcement Cutting-Agent Learning for Video Object Segmentation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[62]  Arnold W. M. Smeulders,et al.  Long-term Tracking in the Wild: A Benchmark , 2018, ECCV.

[63]  Vineet Gandhi,et al.  Long-Term Visual Object Tracking Benchmark , 2017, ACCV.

[64]  Nuno Vasconcelos,et al.  Cascade R-CNN: Delving Into High Quality Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[65]  Bernard Ghanem,et al.  A Benchmark and Simulator for UAV Tracking , 2016, ECCV.

[66]  Luc Van Gool,et al.  A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[67]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[68]  Zhe Chen,et al.  MUlti-Store Tracker (MUSTer): A cognitive psychology inspired approach to object tracking , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[69]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[70]  Alberto Del Bimbo,et al.  Object Tracking by Oversampling Local Features , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[71]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[72]  Yi Wu,et al.  Online Object Tracking: A Benchmark , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[73]  Shao-Yi Chien,et al.  Video Object Segmentation and Tracking Framework With Improved Threshold Decision and Diffusion Distance , 2013, IEEE Transactions on Circuits and Systems for Video Technology.

[74]  Zdenek Kalal,et al.  Tracking-Learning-Detection , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[75]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[76]  Wei Tang,et al.  Ensembling neural networks: Many could be better than all , 2002, Artif. Intell..