Connecting Language and Vision for Natural Language-Based Vehicle Retrieval
暂无分享,去创建一个
Chang Zhou | Zhedong Zheng | Hongxia Yang | Shuai Bai | Junyang Lin | Yi Yang | Zhu Zhang | Xiaohan Wang | Chang Zhou | Hongxia Yang | Zhedong Zheng | Yi Yang | Zhu Zhang | Junyang Lin | Xiaohan Wang | Shuai Bai
[1] Naokazu Yokoya,et al. Learning Joint Representations of Videos and Sentences with Web Image Search , 2016, ECCV Workshops.
[2] Xiaogang Wang,et al. Orientation Invariant Feature Embedding and Spatial Temporal Regularization for Vehicle Re-identification , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[3] Luowei Zhou,et al. Weakly-Supervised Video Object Grounding from Text by Loss Weighting and Object Interaction , 2018, BMVC.
[4] Jongwook Choi,et al. Video Captioning and Retrieval Models with Semantic Attention , 2016, ArXiv.
[5] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.
[6] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.
[7] Boqing Gong,et al. Not All Frames Are Equal: Weakly-Supervised Video Grounding With Contextual Similarity and Visual Clustering Losses , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[8] Aleksandr Petiushko,et al. MDMMT: Multidomain Multimodal Transformer for Video Retrieval , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).
[9] Jiebo Luo,et al. Learning 2D Temporal Adjacent Networks for Moment Localization with Natural Language , 2019, AAAI.
[10] Andrew Zisserman,et al. Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval , 2021, ArXiv.
[11] Ramakant Nevatia,et al. TALL: Temporal Activity Localization via Language Query , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[12] Zhedong Zheng,et al. Dual-path Convolutional Image-Text Embeddings with Instance Loss , 2017, ACM Trans. Multim. Comput. Commun. Appl..
[13] Claude Coulombe,et al. Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs , 2018, ArXiv.
[14] Yang Zhao,et al. Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[15] Michael I. Jordan,et al. Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.
[16] Chen Sun,et al. Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification , 2017, ECCV.
[17] Tao Xiang,et al. Multi-scale Deep Learning Architectures for Person Re-identification , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[18] Tatsuya Harada,et al. Spatio-Temporal Person Retrieval via Natural Language Queries , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[19] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[20] Andrew Zisserman,et al. Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.
[21] Kai Zou,et al. EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks , 2019, EMNLP.
[22] Wei Chen,et al. Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework , 2015, AAAI.
[23] Meng Liu,et al. Attentive Moment Retrieval in Videos , 2018, SIGIR.
[24] Yu Cheng,et al. Dual-Mode Vehicle Motion Pattern Learning for High Performance Road Traffic Anomaly Detection , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).
[25] Zhe Gan,et al. Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[26] Bowen Zhang,et al. Cross-Modal and Hierarchical Modeling of Video and Text , 2018, ECCV.
[27] Cordelia Schmid,et al. Learning Video Representations using Contrastive Bidirectional Transformer , 2019 .
[28] Liang Zheng,et al. Improving Person Re-identification by Attribute and Identity Learning , 2017, Pattern Recognit..
[29] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[30] Qi Tian,et al. Beyond Part Models: Person Retrieval with Refined Part Pooling , 2017, ECCV.
[31] Yunchao Wei,et al. VehicleNet: Learning Robust Visual Representation for Vehicle Re-Identification , 2020, IEEE Transactions on Multimedia.
[32] Quoc V. Le,et al. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.
[33] Ivan Laptev,et al. Learning a Text-Video Embedding from Incomplete and Heterogeneous Data , 2018, ArXiv.
[34] Lin Ma,et al. Weakly-Supervised Spatio-Temporally Grounding Natural Sentence in Video , 2019, ACL.
[35] Jonas Mueller,et al. Siamese Recurrent Architectures for Learning Sentence Similarity , 2016, AAAI.
[36] Qun Liu,et al. TinyBERT: Distilling BERT for Natural Language Understanding , 2020, EMNLP.
[37] Nan Duan,et al. UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation , 2020, ArXiv.
[38] Diyi Yang,et al. That’s So Annoying!!!: A Lexical and Frame-Semantic Embedding Based Data Augmentation Approach to Automatic Categorization of Annoying Behaviors using #petpeeve Tweets , 2015, EMNLP.
[39] Xianyan Jia,et al. M6: A Chinese Multimodal Pretrainer , 2021, ArXiv.
[40] Andrew Zisserman,et al. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[41] Zhedong Zheng,et al. Joint Discriminative and Generative Learning for Person Re-Identification , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[42] Xirong Li,et al. Dual Encoding for Zero-Example Video Retrieval , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[43] Siddhant Garg,et al. BAE: BERT-based Adversarial Examples for Text Classification , 2020, EMNLP.
[44] Jongwook Choi,et al. End-to-End Concept Word Detection for Video Captioning, Retrieval, and Question Answering , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[45] Trevor Darrell,et al. Localizing Moments in Video with Natural Language , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[46] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.
[47] Linchao Zhu,et al. T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[48] Yann LeCun,et al. Barlow Twins: Self-Supervised Learning via Redundancy Reduction , 2021, ICML.
[49] Ling-Yu Duan,et al. VERI-Wild: A Large Dataset and a New Method for Vehicle Re-Identification in the Wild , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[50] Xiang Zhang,et al. Character-level Convolutional Networks for Text Classification , 2015, NIPS.
[51] Chen Sun,et al. Multi-modal Transformer for Video Retrieval , 2020, ECCV.
[52] Zhe Gan,et al. HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training , 2020, EMNLP.
[53] Stan Sclaroff,et al. CityFlow-NL: Tracking and Retrieval of Vehicles at City Scale by Natural Language Descriptions , 2021, ArXiv.
[54] Jeffrey Dean,et al. Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.
[55] Tiejun Huang,et al. Deep Relative Distance Learning: Tell the Difference between Similar Vehicles , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[56] Enhua Wu,et al. Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[57] Cordelia Schmid,et al. VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[58] Amit K. Roy-Chowdhury,et al. Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval , 2018, ICMR.
[59] Larry S. Davis,et al. MAN: Moment Alignment Network for Natural Language Moment Retrieval via Iterative Graph Adjustment , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[60] Wei Wu,et al. Traffic Anomaly Detection via Perspective Map based on Spatial-temporal Information Matrix , 2019, CVPR Workshops.
[61] Hongyu Guo,et al. Augmenting Data with Mixup for Sentence Classification: An Empirical Study , 2019, ArXiv.
[62] Juan Carlos Niebles,et al. Finding "It": Weakly-Supervised Reference-Aware Visual Grounding in Instructional Videos , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[63] Ellen M. Voorhees,et al. The TREC-8 Question Answering Track Report , 1999, TREC.
[64] Rico Sennrich,et al. Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.
[65] Quoc V. Le,et al. Unsupervised Data Augmentation for Consistency Training , 2019, NeurIPS.
[66] Guillaume Lample,et al. Unsupervised Machine Translation Using Monolingual Corpora Only , 2017, ICLR.
[67] Sanja Fidler,et al. Visual Semantic Search: Retrieving Videos via Complex Textual Queries , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.
[68] Oriol Vinyals,et al. Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.
[69] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.
[70] Tao Xiang,et al. The Devil is in the Middle: Exploiting Mid-level Representations for Cross-Domain Instance Matching , 2017, ArXiv.