Towards Accurate Visual and Natural Language-Based Vehicle Retrieval Systems

In this work, we consider two tracks of the 2021 NVIDIA AI City Challenge, the City-Scale Multi-Camera Vehicle Re-identification and Natural language-based Vehicle Retrieval. For the vehicle re-identification task, we employ the state-of-art Excited Vehicle Re-Identification deep representation learning model coupled with best training practices and domain adaptation techniques to obtain robust embeddings. We further refine the re-identification results through a series of post-processing steps to remove camera and vehicle orientation bias that is inherent in the task of re-identification. We also take advantage of multiple observations of a vehicle using track-level information and finally obtain fine-grained retrieval results. For the task of Natural language-based vehicle retrieval we leverage the recently proposed Contrastive Language-Image Pre-training model and propose a simple yet effective text-based vehicle retrieval system. We compare our performance against the top submissions to the challenge and our systems are ranked 8th in the public leaderboard for both tracks.

[1]  Ross B. Girshick,et al.  Mask R-CNN , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Kurt Keutzer,et al.  Visual Transformers: Token-based Image Representation and Processing for Computer Vision , 2020, ArXiv.

[3]  Stan Sclaroff,et al.  CityFlow-NL: Tracking and Retrieval of Vehicles at City Scale by Natural Language Descriptions , 2021, ArXiv.

[4]  Wei Jiang,et al.  Bag of Tricks and a Strong Baseline for Deep Person Re-Identification , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[5]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[7]  Rama Chellappa,et al.  The Devil is in the Details: Self-Supervised Attention for Vehicle Re-Identification , 2020, ECCV.

[8]  Yi Yang,et al.  Random Erasing Data Augmentation , 2017, AAAI.

[9]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Hao Wu,et al.  Mixed Precision Training , 2017, ICLR.

[11]  Yi Yang,et al.  Going Beyond Real Data: A Robust Visual Representation for Vehicle Re-identification , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[12]  Liang Zheng,et al.  Simulating Content Consistent Vehicle Datasets with Attribute Descent , 2019, ECCV.

[13]  Ellen M. Voorhees,et al.  The TREC-8 Question Answering Track Report , 1999, TREC.

[14]  Marcus Rohrbach,et al.  12-in-1: Multi-Task Vision and Language Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[16]  Vladlen Koltun,et al.  Tracking Objects as Points , 2020, ECCV.

[17]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[18]  Tiejun Huang,et al.  Deep Relative Distance Learning: Tell the Difference between Similar Vehicles , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Wei Jiang,et al.  SphereReID: Deep Hypersphere Manifold Embedding for Person Re-Identification , 2018, J. Vis. Commun. Image Represent..

[20]  Kihyuk Sohn,et al.  Improved Deep Metric Learning with Multi-class N-pair Loss Objective , 2016, NIPS.

[21]  Rama Chellappa,et al.  Attention Driven Vehicle Re-identification and Unsupervised Anomaly Detection for Traffic Understanding , 2019, CVPR Workshops.

[22]  Xiaogang Wang,et al.  Orientation Invariant Feature Embedding and Spatial Temporal Regularization for Vehicle Re-identification , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[23]  Jianfeng Gao,et al.  Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.

[24]  Ling-Yu Duan,et al.  VERI-Wild: A Large Dataset and a New Method for Vehicle Re-Identification in the Wild , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Yinghui Xu,et al.  Multiple Object Tracking with Correlation Learning , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Tao Mei,et al.  FastReID: A Pytorch Toolbox for General Instance Re-identification , 2020, ArXiv.

[27]  Jenq-Neng Hwang,et al.  CityFlow: A City-Scale Benchmark for Multi-Target Multi-Camera Vehicle Tracking and Re-Identification , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Rama Chellappa,et al.  A Dual-Path Model With Adaptive Attention for Vehicle Re-Identification , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[29]  Lucas Beyer,et al.  In Defense of the Triplet Loss for Person Re-Identification , 2017, ArXiv.

[30]  Wu Liu,et al.  Large-scale vehicle re-identification in urban surveillance videos , 2016, 2016 IEEE International Conference on Multimedia and Expo (ICME).

[31]  Chenggang Yan,et al.  Beyond the Parts: Learning Multi-view Cross-part Correlation for Vehicle Re-identification , 2020, ACM Multimedia.

[32]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[33]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[34]  Pichao Wang,et al.  TransReID: Transformer-based Object Re-Identification , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[35]  Yu Cheng,et al.  UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.

[36]  Harshad Rai,et al.  Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks , 2018 .

[37]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[38]  Giorgos Tolias,et al.  Fine-Tuning CNN Image Retrieval with No Human Annotation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Quoc V. Le,et al.  AutoAugment: Learning Augmentation Policies from Data , 2018, ArXiv.

[40]  Hanqing Lu,et al.  Learning Coarse-to-Fine Structured Feature Embedding for Vehicle Re-Identification , 2018, AAAI.