Unsupervised Cross-Modal Distillation for Thermal Infrared Tracking

The target representation learned by convolutional neural networks plays an important role in Thermal Infrared (TIR) tracking. Currently, most of the top-performing TIR trackers are still employing representations learned by the model trained on the RGB data. However, this representation does not take into account the information in the TIR modality itself, limiting the performance of TIR tracking. To solve this problem, we propose to distill representations of the TIR modality from the RGB modality with Cross-Modal Distillation (CMD) on a large amount of unlabeled paired RGB-TIR data. We take advantage of the two-branch architecture of the baseline tracker, i.e. DiMP, for cross-modal distillation working on two components of the tracker. Specifically, we use one branch as a teacher module to distill the representation learned by the model into the other branch. Benefiting from the powerful model in the RGB modality, the cross-modal distillation can learn the TIR-specific representation for promoting TIR tracking. The proposed approach can be incorporated into different baseline trackers conveniently as a generic and independent component. Furthermore, the semantic coherence of paired RGB and TIR images is utilized as a supervised signal in the distillation loss for cross-modal knowledge transfer. In practice, three different approaches are explored to generate paired RGB-TIR patches with the same semantics for training in an unsupervised way. It is easy to extend to an even larger scale of unlabeled training data. Extensive experiments on the LSOTB-TIR dataset and PTB-TIR dataset demonstrate that our proposed cross-modal distillation method effectively learns TIR-specific target representations transferred from the RGB modality. Our tracker outperforms the baseline tracker by achieving absolute gains of 2.3% Success, 2.7% Precision, and 2.5% Normalized Precision respectively. Code and models are available at https://github.com/zhanglichao/cmdTIRtracking.

[1]  Fahad Shahbaz Khan,et al.  Synthetic Data Generation for End-to-End Thermal Infrared Tracking , 2018, IEEE Transactions on Image Processing.

[2]  Michael Felsberg,et al.  ATOM: Accurate Tracking by Overlap Maximization , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[4]  Fei Wang,et al.  Large Margin Structured Convolution Operator for Thermal Infrared Object Tracking , 2018, 2018 24th International Conference on Pattern Recognition (ICPR).

[5]  Jiaolong Xu,et al.  Pedestrian Detection at Day/Night Time with Visible and FIR Cameras: A Comparison , 2016, Sensors.

[6]  Hong-Yuan Mark Liao,et al.  YOLOv4: Optimal Speed and Accuracy of Object Detection , 2020, ArXiv.

[7]  Bohyung Han,et al.  Learning Multi-domain Convolutional Neural Networks for Visual Tracking , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Zhenyu He,et al.  LSOTB-TIR: A Large-Scale High-Diversity Thermal Infrared Object Tracking Benchmark , 2020, ACM Multimedia.

[9]  Zhenyu He,et al.  Target-Aware Deep Tracking , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Zhenyu He,et al.  Hierarchical spatial-aware Siamese network for thermal infrared object tracking , 2017, Knowl. Based Syst..

[11]  Jianbing Shen,et al.  Triplet Loss in Siamese Network for Object Tracking , 2018, ECCV.

[12]  Qi Tian,et al.  Multi-cue Correlation Filters for Robust Visual Tracking , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[13]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[14]  Guillaume-Alexandre Bilodeau,et al.  An iterative integrated framework for thermal-visible image registration, sensor fusion, and people tracking for video surveillance applications , 2012, Comput. Vis. Image Underst..

[15]  Qiang Wang,et al.  Fast Online Object Tracking and Segmentation: A Unifying Approach , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[17]  Namil Kim,et al.  Multispectral pedestrian detection: Benchmark dataset and baseline , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Michael Felsberg,et al.  ECO: Efficient Convolution Operators for Tracking , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Min Li,et al.  Infrared Target Tracking Based on Robust Low-Rank Sparse Learning , 2016, IEEE Geoscience and Remote Sensing Letters.

[20]  Feng Li,et al.  Learning Spatial-Temporal Regularized Correlation Filters for Visual Tracking , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21]  Jian Zhang,et al.  Learning Local-Global Multi-Graph Descriptors for RGB-T Object Tracking , 2019, IEEE Transactions on Circuits and Systems for Video Technology.

[22]  Wei Wu,et al.  SiamRPN++: Evolution of Siamese Visual Tracking With Very Deep Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[24]  Liying Zheng,et al.  Thermal infrared pedestrian tracking using joint siamese network and exemplar prediction model , 2020, Pattern Recognit. Lett..

[25]  Michael Felsberg,et al.  Learning Spatially Regularized Correlation Filters for Visual Tracking , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[26]  Ales Leonardis,et al.  Visual Object Tracking Performance Measures Revisited , 2015, IEEE Transactions on Image Processing.

[27]  Luca Bertinetto,et al.  Fully-Convolutional Siamese Networks for Object Tracking , 2016, ECCV Workshops.

[28]  Luca Bertinetto,et al.  Staple: Complementary Learners for Real-Time Tracking , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Zhenyu He,et al.  Multi-Task Driven Feature Models for Thermal Infrared Tracking , 2019, AAAI.

[30]  R. Venkatesh Babu,et al.  UM-Adapt: Unsupervised Multi-Task Adaptation Using Adversarial Cross-Task Distillation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[31]  Xin Wang,et al.  Fast RGB-T Tracking via Cross-Modal Correlation Filters , 2019, Neurocomputing.

[32]  Zhenyu He,et al.  Deep convolutional neural networks for thermal infrared object tracking , 2017, Knowl. Based Syst..

[33]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Dong Wang,et al.  Multi-modal visual tracking: Review and experimental comparison , 2020, Comput. Vis. Media.

[35]  Ying Li,et al.  Real-time infrared target tracking based on ℓ1 minimization and compressive features. , 2014, Applied optics.

[36]  Sergio Escalera,et al.  Multi-modal RGB–Depth–Thermal Human Body Segmentation , 2016, International Journal of Computer Vision.

[37]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Zhenyu He,et al.  The Thermal Infrared Visual Object Tracking VOT-TIR2016 Challenge Results , 2016, ECCV Workshops.

[39]  Jin Tang,et al.  Weighted Sparse Representation Regularized Graph Learning for RGB-T Object Tracking , 2017, ACM Multimedia.

[40]  Ming-Hsuan Yang,et al.  CrDoCo: Pixel-Level Domain Transfer With Cross-Domain Consistency , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Ming-Hsuan Yang,et al.  Object Tracking Benchmark , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Jiayi Ma,et al.  Infrared and visible image fusion via gradient transfer and total variation minimization , 2016, Inf. Fusion.

[43]  Jin Tang,et al.  RGB-T Object Tracking: Benchmark and Baseline , 2018, Pattern Recognit..

[44]  Gang Xiao,et al.  SiamFT: An RGB-Infrared Fusion Tracking Method via Fully Convolutional Siamese Networks , 2019, IEEE Access.

[45]  Yonglong Tian,et al.  Contrastive Representation Distillation , 2019, ICLR.

[46]  Rui Caseiro,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence High-speed Tracking with Kernelized Correlation Filters , 2022 .

[47]  Qifeng Yu,et al.  Dense structural learning for infrared object tracking at 200+ Frames per Second , 2017, Pattern Recognit. Lett..

[48]  C S Asha,et al.  Robust infrared target tracking using discriminative and generative approaches , 2017 .

[49]  Liyi Dai,et al.  Cross-Modality Distillation: A Case for Conditional Generative Adversarial Networks , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[50]  Thanh-Toan Do,et al.  Compact Trilinear Interaction for Visual Question Answering , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[51]  A. Aydin Alatan,et al.  Evaluation of Feature Channels for Correlation-Filter-Based Visual Object Tracking in Infrared Spectrum , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[52]  Zhenyu He,et al.  Learning Deep Multi-Level Similarity for Thermal Infrared Object Tracking , 2019, IEEE Transactions on Multimedia.

[53]  Rynson W. H. Lau,et al.  VITAL: VIsual Tracking via Adversarial Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[54]  Wenbing Tao,et al.  Once for All: A Two-Flow Convolutional Neural Network for Visual Tracking , 2016, IEEE Transactions on Circuits and Systems for Video Technology.

[55]  L. Gool,et al.  Learning Discriminative Model Prediction for Tracking , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[56]  Jiri Matas,et al.  A Novel Performance Evaluation Methodology for Single-Target Trackers , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[57]  Henry Leung,et al.  Object fusion tracking based on visible and infrared images: A comprehensive review , 2020, Inf. Fusion.

[58]  Fan Yang,et al.  LaSOT: A High-Quality Benchmark for Large-Scale Single Object Tracking , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Xin Zhao,et al.  GOT-10k: A Large High-Diversity Benchmark for Generic Object Tracking in the Wild , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[60]  Shengping Zhang,et al.  Modality-correlation-aware sparse representation for RGB-infrared object tracking , 2020, Pattern Recognit. Lett..

[61]  Liang Lin,et al.  Learning Collaborative Sparse Representation for Grayscale-Thermal Tracking. , 2016, IEEE transactions on image processing : a publication of the IEEE Signal Processing Society.

[62]  Zhenyu He,et al.  PTB-TIR: A Thermal Infrared Pedestrian Tracking Benchmark , 2018, IEEE Transactions on Multimedia.

[63]  Jin Tang,et al.  Quality-Aware Feature Aggregation Network for Robust RGBT Tracking , 2020, IEEE Transactions on Intelligent Vehicles.