Keyword-based Vehicle Retrieval

Natural language-based vehicle retrieval system makes controlling a city-scale traffic system easy to maintain and adaptable to changing requirements. It provides a convenient means in managing traffic flows or detecting accidents related to a specific vehicle. Such a system is different from most query-based video retrieval systems because the language for traffic situations and visible objects in traffic video streams are limited. Existing techniques for language-based general video retrieval problems measure the similarity between language representations and video representations. Our system focuses on several features that can distinguish vehicles from others. Particularly, our proposed vehicle retrieval system defines a set of features that can differentiate a vehicle from others and calculates the similarity between queries and video frames based on the features. The proposed technique places our approach in the third place in the 2021 AI City Challenge.

[1]  Aleksandr Petiushko,et al.  MDMMT: Multidomain Multimodal Transformer for Video Retrieval , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[2]  Fabio Tozeto Ramos,et al.  Simple online and realtime tracking , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[3]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Seong Joon Oh,et al.  CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[5]  Bernhard P. Wrobel,et al.  Multiple View Geometry in Computer Vision , 2001 .

[6]  Chen Sun,et al.  Multi-modal Transformer for Video Retrieval , 2020, ECCV.

[7]  Stan Sclaroff,et al.  CityFlow-NL: Tracking and Retrieval of Vehicles at City Scale by Natural Language Descriptions , 2021, ArXiv.

[8]  Yang Liu,et al.  Use What You Have: Video retrieval using representations from collaborative experts , 2019, BMVC.

[9]  Ellen M. Voorhees,et al.  The TREC-8 Question Answering Track Report , 1999, TREC.

[10]  Jenq-Neng Hwang,et al.  CityFlow: A City-Scale Benchmark for Multi-Target Multi-Camera Vehicle Tracking and Re-Identification , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Jenq-Neng Hwang,et al.  Single-Camera and Inter-Camera Vehicle Tracking and 3D Speed Estimation Based on Fusion of Visual and Semantic Features , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[12]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[13]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[14]  Liang Zheng,et al.  The 4th AI City Challenge , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[15]  David C. Anastasiu,et al.  The NVIDIA AI City Challenge , 2017, 2017 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computed, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI).

[16]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[17]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Zhe Gan,et al.  Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Jenq-Neng Hwang,et al.  The 2019 AI City Challenge , 2019, CVPR Workshops.

[21]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[22]  Gedas Bertasius,et al.  Is Space-Time Attention All You Need for Video Understanding? , 2021, ICML.

[23]  Jenq-Neng Hwang,et al.  Multiple-Kernel Based Vehicle Tracking Using 3D Deformable Model and Camera Self-Calibration , 2017, ArXiv.

[24]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[25]  Jenq-Neng Hwang,et al.  The 2018 NVIDIA AI City Challenge , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[26]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Dietrich Paulus,et al.  Simple online and realtime tracking with a deep association metric , 2017, 2017 IEEE International Conference on Image Processing (ICIP).

[28]  Jenq-Neng Hwang,et al.  MOANA: An Online Learned Adaptive Appearance Model for Robust Multiple Object Tracking in 3D , 2019, IEEE Access.

[29]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[30]  Andrew Zisserman,et al.  Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval , 2021, ArXiv.