论文信息 - Text-based Person Search without Parallel Image-Text Data

Text-based Person Search without Parallel Image-Text Data

Text-based person search (TBPS) aims to retrieve the images of the target person from a large image gallery based on a given natural language description. Existing methods are dominated by training models with parallel image-text pairs, which are very costly to collect. In this paper, we make the first attempt to explore TBPS without parallel image-text data ($\mu$-TBPS), in which only non-parallel images and texts, or even image-only data, can be adopted. Towards this end, we propose a two-stage framework, generation-then-retrieval (GTR), to first generate the corresponding pseudo text for each image and then perform the retrieval in a supervised manner. In the generation stage, we propose a fine-grained image captioning strategy to obtain an enriched description of the person image, which firstly utilizes a set of instruction prompts to activate the off-the-shelf pretrained vision-language model to capture and generate fine-grained person attributes, and then converts the extracted attributes into a textual description via the finetuned large language model or the hand-crafted template. In the retrieval stage, considering the noise interference of the generated texts for training model, we develop a confidence score-based training scheme by enabling more reliable texts to contribute more during the training. Experimental results on multiple TBPS benchmarks (i.e., CUHK-PEDES, ICFG-PEDES and RSTPReid) show that the proposed GTR can achieve a promising performance without relying on parallel image-text data.

[1] J. Liu,et al. Calibrating Cross-modal Features for Text-Based Person Searching , 2023, 2304.02278.

[2] Mang Ye,et al. Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person Retrieval , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3] Yuan Wu,et al. Asymmetric Cross-Scale Alignment for Text-Based Person Search , 2022, IEEE Transactions on Multimedia.

[4] Jinhui Tang,et al. CLIP-Driven Fine-Grained Text-Image Person Re-Identification , 2022, IEEE Transactions on Image Processing.

[5] X. Bai,et al. Conditional Feature Learning Based Transformer for Text-Based Person Search , 2022, IEEE Transactions on Image Processing.

[6] Xili Wan,et al. CAIBC: Capturing All-round Information Beyond Color for Text-based Person Retrieval , 2022, ACM Multimedia.

[7] Xili Wan,et al. Look Before You Leap: Improving Text-based Person Retrieval by Learning A Consistent Cross-modal Common Manifold , 2022, ACM Multimedia.

[8] Xiao Wang,et al. See Finer, See More: Implicit Modality Alignment for Text-based Person Retrieval , 2022, ECCV Workshops.

[9] Changxing Ding,et al. Learning Granularity-Unified Representations for Text-to-Image Person Re-identification , 2022, ACM Multimedia.

[10] Wenhao Jiang,et al. VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix , 2022, ICML.

[11] Liqiang Nie,et al. Image-text Retrieval: A Survey on Recent Research and Development , 2022, IJCAI.

[12] Ning Zhang,et al. Unsupervised Vision-and-Language Pretraining via Retrieval-based Multi-Granular Alignment , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13] Jingren Zhou,et al. OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework , 2022, ICML.

[14] S. Hoi,et al. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation , 2022, ICML.

[15] Min Zhang,et al. Learning Semantic-Aligned Feature Representation for Text-Based Person Search , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16] Tao Xiang,et al. Text-Based Person Search with Limited Data , 2021, BMVC.

[17] Xiaoguang Han,et al. LapsCore: Language-guided Person Search via Color Reasoning , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[18] Changxin Gao,et al. Weakly Supervised Text-based Person Re-Identification , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[19] Gang Hua,et al. DSSL: Deep Surroundings-person Separation Learning for Text-based Person Retrieval , 2021, ACM Multimedia.

[20] Dacheng Tao,et al. Semantically Self-Aligned Network for Text-to-Image Part-aware Person Re-identification , 2021, ArXiv.

[21] Junnan Li,et al. Align before Fuse: Vision and Language Representation Learning with Momentum Distillation , 2021, NeurIPS.

[22] Shih-Fu Chang,et al. Unsupervised Vision-and-Language Pre-training Without Parallel Images and Captions , 2021, NAACL.

[23] Yuhui Zheng,et al. TIPCB: A Simple but Effective Part-based Convolutional Baseline for Text-based Person Search , 2021, Neurocomputing.

[24] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[25] Meng Wang,et al. Unpaired Image Captioning With semantic-Constrained Self-Learning , 2021, IEEE Transactions on Multimedia.

[26] Jiebo Luo,et al. Unsupervised text-to-image synthesis , 2021, Pattern Recognit..

[27] Jun Zhang,et al. Contextual Non-Local Alignment over Full-Scale Representation for Text-Based Person Search , 2021, ArXiv.

[28] Yuning Jiang,et al. Learning the Best Pooling Strategy for Visual Semantic Embedding , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29] Kecheng Zheng,et al. Hierarchical Gumbel Attention Network for Text-based Person Search , 2020, ACM Multimedia.

[30] Yang Wang,et al. Recurrent Relational Memory Network for Unsupervised Image Captioning , 2020, IJCAI.

[31] Liang Wang,et al. Cross-Modal Cross-Domain Moment Alignment Network for Person Search , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32] Zhe Wang,et al. ViTAA: Visual-Textual Attributes Alignment in Person Search by Natural Language , 2020, ECCV.

[33] Mang Ye,et al. A Survey of Open-World Person Re-Identification , 2020, IEEE Transactions on Circuits and Systems for Video Technology.

[34] Josef Kittler,et al. A Convolutional Baseline for Person Re-Identification Using Vision and Language Descriptions , 2020, ArXiv.

[35] Tao Xiang,et al. Deep Learning for Person Re-Identification: A Survey and Outlook , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36] Tianhao Zhang,et al. Exploring Semantic Relationships for Image Captioning without Parallel Data , 2019, 2019 IEEE International Conference on Data Mining (ICDM).

[37] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[38] Nassir Navab,et al. Towards Unsupervised Image Captioning With Shared Multimodal Embeddings , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[39] Gang Wang,et al. Unpaired Image Captioning via Scene Graph Alignments , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[40] Yang Feng,et al. Unsupervised Image Captioning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41] T. Tan,et al. Pose-Guided Multi-Granularity Attention Network for Text-Based Person Search , 2018, AAAI.

[42] Huchuan Lu,et al. Deep Cross-Modal Projection Learning for Image-Text Matching , 2018, ECCV.

[43] Xiaogang Wang,et al. Improving Deep Visual Representation for Person Re-identification by Global and Local Image-language Association , 2018, ECCV.

[44] Mark Hopkins,et al. Extending a Parser to Distant Domains Using a Few Dozen Partially Annotated Examples , 2018, ACL.

[45] Zhedong Zheng,et al. Dual-path Convolutional Image-Text Embeddings with Instance Loss , 2017, ACM Trans. Multim. Comput. Commun. Appl..

[46] Xiaogang Wang,et al. Identity-Aware Textual-Visual Matching with Latent Co-attention , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[47] Lei Zhang,et al. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[48] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[49] Xiaogang Wang,et al. Person Search with Natural Language Description , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50] Bernt Schiele,et al. Learning Deep Representations of Fine-Grained Visual Descriptions , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51] Samy Bengio,et al. Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52] Peifeng Wang,et al. A Simple and Robust Correlation Filtering Method for Text-Based Person Search , 2022, ECCV.

[53] Maosong Sun,et al. End-to-End Unsupervised Vision-and-Language Pre-training with Referring Expression Matching , 2022, EMNLP.

[54] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .

[55] Steven Bird. NLTK: The Natural Language Toolkit , 2006, ACL.